Spatial-temporal human gesture recognition under degraded conditions using three-dimensional integral imaging

Xin Shen; Hee-seung Kim; Komatsu Satoru; Adam Markman; Bahram Javidi

doi:10.1364/OE.26.013938

1. Introduction

Human gesture is one of the elemental actions for human activities, movements, and non-verbal communications [1, 2]. In particular, vision-based gesture detection and recognition has a wide range of applications in human computer interaction, security, surveillance, robotics, medicine, education, etc [3–6]. By capturing human actions through means such as cameras, magnetic field trackers or instrumented gloves, a receiver can recognize and classify the activities using mathematical or computational models [7, 8]. When tracking gestures based on a video sequence, correlation approaches have been investigated as these techniques do not require prior object detection and segmentation. Due to these properties, researchers have been increasingly studying the use of correlation filters for human gesture detection and recognition [9–12].

Human gesture recognition is still a challenging topic in computer science and vision communities. One problem is that features from the gesture and action under degraded conditions may not be fully recorded in the sensing process. For example, the gesture may be partially occluded in a complex scene or there may be poor illumination conditions resulting in degraded image quality. As a result, gesture recognition or detection in degraded conditions is more difficult. Researchers have introduced promising imaging approaches and state-of-the-art algorithms to solve human activity recognition problems [13, 14]. Among various advanced imaging technologies, the three-dimensional (3D) integral imaging technology [15–33] is a promising imaging technique that can capture multiple perspectives of the scene using a micro-lens array or a camera array, and it has been proposed for gesture recognition [34, 35]. By capturing the unique perspectives, a 3D (x, y, z) image can be reconstructed that provides both lateral and depth information. Moreover, the 3D image removes occlusion in front of the object of interest and reduces additive noise present in the scene.

In this paper, we present an integral imaging system based spatial-temporal correlation method to recognize human gesture actions under degraded conditions. The degraded conditions may include occlusion and obscurations as well as low light level in the scene. The system is tested with a variety of classification algorithms such as linear and non-linear distortion-invariant filters; space-time interest points (STIP) feature detector [36] and 3D histogram of oriented gradients (3D HOG) feature descriptor [37, 38] with a standard bag-of-features support vector machine (SVM) classification framework [39], etc. The results with different classification algorithms are compared using a variety of performance metrics. The classification approaches are trained using 3D volume (x, y, t) data for the reconstructed range or longitudinal depth of interest with a series of video image sequences. Test data is acquired using integral imaging passive sensing and computational reconstruction to increase signal-to-noise ratio (SNR) by removing partial occlusion and decreasing noise in the degraded conditions. The total variation (TV) denoising algorithm [40] is then used to reduce noise and preserve edges. The four-dimensional (4D) volume (x, y, t; z) for a fixed reconstruction distance z is then processed for gesture recognition. For the correlation approach, the input data is correlated with the synthesized 3D filter template in the frequency domain. Classification can be performed on the correlation output to detect peaks corresponding to the spatial-temporal location of the gesture. Our optical experiments show that the proposed 4D spatial-temporal integral imaging with nonlinear correlation algorithms is promising in gesture recognition under poor illumination conditions with partial occlusion.

This paper is organized as follows: the integral imaging passive sensing and reconstruction processes are briefly reviewed in Section. 2. The detailed procedures of the linear and non-linear distortion-invariant correlation approach are explained in Section 3. In Section 4, we describe the experimental results for human gesture recognition under degraded conditions using the proposed method and compare the nonlinear correlation approach with previously proposed standard bag-of-features SVM approach. Conclusions are given in Section 5.

2. 3D sensing and reconstruction for degraded conditions using integral imaging

Integral imaging is a passive three-dimensional (3D) sensing and visualization technology that captures both the intensity and directional information of a scene from multiple sensing perspectives. The original concept of integral imaging, which was proposed by Lippmann in [15], uses a lenslet array and a photographic film for 3D sensing [16, 17]. The reconstruction process is the inverse of the sensing, which optically integrates true 3D images from the captured multi-perspective 2D elemental images. Thanks to the rapid development of novel electrical devices and computational technologies, integral imaging has been revived and expanded in recent decades [18–24]. It has been shown that integral imaging is capable of reducing noise in the captured images due to partial occlusion [25], low light illumination [26, 27] and scattering medium [28–30], etc. By replacing the stationary lenslet array with a moving lenslet or camera array, high performance real time 3D optical sensing and reconstruction can be realized by synthetic aperture integral imaging (SAII) [31]. Figure 1 illustrates the camera array capture and reconstruction stages. The computational reconstruction algorithm digitally re-project the captured multi-perspective 2D elemental images into the 3D space within a specific depth range [32]. For this paper, we consider a spatial-temporal volume video data to discuss the reconstruction process:

R e c o n (x, y, z, t) = \frac{1}{O (x, y, t)} \sum_{i = 0}^{K - 1} \sum_{j = 0}^{L - 1} E I^{i, j} (x - i \frac{r_{x} \times p_{x}}{M \times d_{x}}, y - j \frac{r_{y} \times p_{y}}{M \times d_{y}}, t),

where Recon(.) is the integral imaging reconstructed video,

x

and

y

are the pixel index on each video frame, z is the reconstruction depth, and t represents the frame index. The reconstructed 3D frame is obtained by shifting and overlapping the

K \times L

elemental images (EI^i,j) on a specific depth plane. For the shifting parameters,

r_{x}

,

r_{y}

and

d_{x}, d_{y}

indicate the resolution and physical size of the image sensor, respectively;

p_{x}

,

p_{y}

are the pitch of the adjacent image sensors on the camera array; and M = z/f represents the magnification factor which depends on the reconstructed depth and focal length of the camera.

O (x, y, t)

is a matrix which stores the information corresponding to the number of overlapping pixels. The integral imaging computational reconstruction is able to extract the depth information and decrease the noise by averaging the pixels from multi-perspectives. The depth extraction process makes the 3D points appear out of focus if they are not integrated at the in-focus reconstruction plane, which aides in occlusion removal. To further reduce the effect of noise in the reconstructed 3D image, denoising algorithms can be considered for image smoothing using the total variation (TV) denoising algorithms [26, 40].

Fig. 1 Concept of 3D integral imaging: (a) Synthetic aperture integral imaging (SAII) [31], and (b) computational reconstruction of integral imaging [32].

Download Full Size | PDF

3. Linear and non-linear distortion-invariant correlation filters

In this section, we describe the 3D linear and nonlinear distortion-invariant correlation process for spatial-temporal human gesture recognition. The process mainly includes three stages: (1) training the correlation filter, (2) 3D correlation in the frequency domain, and (3) output classification and gesture recognition.

3.1 Optimum linear distortion-invariant correlation filter

As we shall show by experiments in section 4, linear correlation was not successful in gesture recognition tests in degraded conditions. An optimum linear distortion-invariant filter for detecting a distorted target with input noise has been presented in [41]. The input noise can be classified into two categories: (a) overlapping additive noise, and (b) nonoverlapping background noise. The linear distortion-invariant filter is designed by maximizing the output peak-to-output-energy (POE) ratio. In the proposed method, we expand the optimum linear correlation filter into three dimensions for both the spatial and temporal information, which will be used in an integral imaging system for gesture recognition. At the training stage, a series of video data, denoted as $r_{i} (x, y, t)$ , of continuous human gesture actions over time is used to train the filter, where x and y represent the spatial coordinates, t indicates the temporal index of the frames and i𝜖[1, 2,…, N] is the number of training videos. Within the 3D volume, a set of corresponding window functions $w_{r i} (x, y, t)$ is defined for the segmentation between the target and background. The value of the window function is unity in the target area and zero elsewhere. The non-overlapping background noise, denoted as $n_{b} (x, y, t) [w_{0} (x, y, t) - w_{r i} (x, y, t)]$ , is located outside of the target window $w_{r i}$ ( $x, y, t$ ), where $w_{0} (x, y, t)$ is a window function with a unity value in the 3D volume data and zero anywhere else. The overlapping additive noise over the scene can be expressed as $n_{a} (x, y, t) w_{0} (x, y, t)$ . Moreover, it is assumed that n_a(.) and n_b(.) are wide sense stationary random noise. The input training data $s (x, y, t)$ will contain distorted targets which are located at position τ. We first perform a 3D Fourier transform of the training video data to convert it into the frequency domain, $S (u, v, φ) = F T [s (x, y, t)] .$ For simplicity, the 3D matrix of the Fourier transformed data is stacked into a one-dimensional (1D) vector, $S_{v e c} (ω), ω \in ℝ^{U \times V \times Φ} .$ The corresponding vector s(p) of $S_{v e c} (ω)$ in the spatial-temporal domain can be obtained by the inverse Fourier transform:

\begin{array}{l} s (p) = F T^{- 1} [S_{v e c} (ω)] = v e c [s (x, y, t)] \\ = \sum_{i = 1}^{N} a_{i} \cdot r_{i} (p - τ) + n_{b} (p) [w_{0} (p) - \sum_{i = 1}^{N} a_{i} \cdot w_{r i} (p - τ)] + n_{a} (p) \cdot w_{0} (p), \end{array}

where p is the data points on the 1D vector, r_i(.) is the training class videos,

τ

is the position of the target, a_i is 1 when the target [

r_{i} (p - τ

)] exists in the input scene, otherwise a_i is 0, and i = 1, 2, …, N. The optimum linear distortion-invariant filter is synthesized as [41]:

\begin{array}{l} H_{o p t}^{*} (ω) = \frac{E [S (ω, τ) \exp (j ω τ)]}{E [{| S (ω, τ) |}^{2}]} \\ = \frac{\sum_{i = 1}^{N} [R_{i} (ω) + m_{b} W_{1 i} (ω) + m_{a} {| W_{0} (ω) |}^{2} / d]}{\sum_{i = 1}^{N} {{| R_{i} (ω) + m_{b} W_{1 i} (ω) + \frac{m_{a} {| W_{0} (ω) |}^{2}}{d} |}^{2} + \frac{W_{2 i} (ω) * N_{b}^{0} (ω)}{2 π} + \frac{{| W_{0} (ω) |}^{2} * N_{a}^{0} (ω)}{2 π} + {(m_{a} + m_{b})}^{2} [W_{2 i} (ω) - {| W_{1 i} (ω) |}^{2}]}}, \end{array}

where i is the training data set of size N; R_i(ω) and W₀(ω) are the corresponding Fourier transform of the target r_i(p) and the window function w₀(p), respectively;

m_{a}

,

m_{b}

are the mean of

n_{a} (p)

,

n_{b} (p),

respectively.

N_{a}^{0}, N_{b}^{0}

are the power spectrum of [

n_{a} (p) - m_{a}

] and [

n_{b} (p) - m_{b}

], respectively;

W_{1 i} (ω) = [{| W_{0} (ω) |}^{2} / d] - W_{r i} (ω),

W_{2 i} (ω) = {| W_{0} (ω) |}^{2} + {| W_{r i} (ω) |}^{2} - {2 {| W_{0} (ω) |}^{2} R e a l [W_{r i} (ω)] / d}

, and

d = \int^{​} w_{0} (p) d p

, [41].

3.2 Non-linear distortion-invariant correlation filter

A non-linear distortion-invariant correlation filter was designed [42] by the extension of the nonlinear correlator architecture [40]. A series of training video frames, also denoted as $r_{i} (x, y, t)$ , of continuous human gesture actions over time is used to train the filter, where x and y represent the spatial coordinates, t indicates the temporal index of the frames, and i = 1, 2, …, N. We first obtain the vectorized frequency domain version of the training data sets, $R_{i} (ω) = v e c {F T [r_{i} (x, y, t)]}$ . Then a matrix can be generated using the N columns training data, $S^{k} = [R_{1}^{k} (ω), R_{2}^{k} (ω), ..., R_{N}^{k} (ω)]$ in frequency domain. The vector operation $v^{k}$ for a complex vector $v$ is [42]:

v^{k} = {[{| v_{1} |}^{k} \times \exp (j ϕ_{1}), {| v_{2} |}^{k} \times \exp (j ϕ_{2}), ..., {| v_{d} |}^{k} \times \exp (j ϕ_{d})]}^{T},

where (.)^T is the transpose operation. The non-linear distortion-invariant correlation filter in the frequency domain is [42]:

H_{k} (ω) = {S^{k} {({[S^{k}]}^{+} S^{k})}^{- 1} c *}^{1 / k},

where (.)⁻¹ represents matrix inverse operation, (.)⁺ is the complex-conjugate transpose, c is the desired cross-correlation output origin values for each training vector, (.)* is the corresponding complex conjugate vector, and {.}^1/k follows the vector operation in Eq. (4). In another implementation of the non-linear correlation, we simply averaged all the reference template videos

r_{i} (x, y, t)

, to obtain averaged templates

\frac{1}{N} [\sum_{i = 1}^{N} r_{i} (x, y, t)] .

Then, we Fourier transformed the averaged templates and vectorized it to obtain

H (ω)

, and applied non-linear transformation to

H (ω)

as we will show in Eq. (6). This implementation is computationally faster as it requires a single Fourier transform unlike the approach in Eqs. (4) and (5) which requires N Fourier transforms. We shall compare the results with different methods in Section 4.

3.3 Linear and non-linear correlation in the frequency domain

The trained filter $H (ω)$ is then converted into a 3D matrix $H (u, v, φ)$ , and will be correlated with the test data $T (u, v, φ)$ in the frequency domain. $T (u, v, φ)$ is the Fourier transform of the reconstructed 4D test video data t(x, y, t; z) with fixed reconstruction depth z. the correlation output is then obtained by an inverse Fourier transform:

g (x, y, t; z) = F T^{- 1} {{[a b s (H) \cdot a b s (T)]}^{k} \times \exp [j (∠ T - ∠ H)]} ​ ​ ​ ​ ​ ​ ​ . (k \in [0, 1])

In Eq. (6), [.]^k represents an exponential operator. k = 1 corresponds to the linear correlation, otherwise it is a k-th order non-linear correlation process [43]. In Section 4, we will show by experiments that the linear distortion-invariant correlation may not be effective in recognition of gestures in the presence of degraded conditions. Thus, we also consider the non-linear correlation process which is implemented by the k-th order non-linear correlation. The k-th order non-linear correlation filtering is more robust in terms of discrimination. To recognize the human gesture, a threshold-based classification process is performed by analyzing the correlation peak-to-output-energy (POE) ratio to find the peaks of the output 4D matrix $g (x, y, t; z)$ with a fixed z. The POE is defined as the ratio between the square of the expected value of the correlation peak to the expected value of the output-signal energy:

P O E = {| E [g (τ, τ)] |}^{2} / E {\bar{{[g (p, τ)]}^{2}}} ​ ​ ​ ​ ​ ​ ​,

where

g (p, τ)

is the vectorized correlation output, the target is at

τ

, and the overbar in the denominator denotes the normalized integration (spatial averaging) over p. The correlation output of a true class test data should have peaks at the central frame. Note that total variation denoising algorithm [40] is applied to integral imaging reconstructed 4D test video data to further remove noise and enhance the visualization of edges. The flow chart of the proposed human gesture recognition procedure is shown in Fig. 2.

Fig. 2 Flow chart of the proposed 3D correlation method for human gesture recognition. SAII = synthetic aperture integral imaging; FFT = Fast Fourier Transform; Thresh. = threshold. POE = correlation peak-to-output-energy ratio [see Eq. (7)].

Download Full Size | PDF

4. Experimental results

In this section, we demonstrate by experiments the performance of the proposed approach in a low light illumination environment with partial occlusion. As shown in Fig. 3(a), a 9-camera array system is built using a 3 × 3 array along the horizontal and vertical directions for synthetic aperture integral imaging (SAII). The camera array is synchronized to record video data sets with a designed pitch of 80 mm in both of the x and y directions. In the experiment, we use Mako G192C machine vision cameras, with all of the cameras in the camera array having identical intrinsic parameters. The focal length of the camera is 50 mm with F/# of 1.8. The sensor read noise is 20.47 electrons rms/pixel, the quantum efficiency is 0.44 at a wavelength of 525 nm, and a frame rate of 20 frame per second was used. Nine sampled frames from a training template video data with a complete human gesture action in the spatial-temporal domain are shown in Fig. 3(b). In the experiment, five training video data sets are used to synthesize the correlation filters following Eqs. (2-3) or Eqs. (4-5), and each training video data has a total of 33 frames. The training video data sets represent a forefinger with continuous waving motion from left to right when looking at the person. For the test video data, we collected 60 human gestures from 6 participants; and each person performed 5 true class and 5 false class gestures respectively. The true class human gestures are similar to the gestures used in training the correlation filters [see Fig. 3(c)]. However, the false class gestures are various gestures moving in different directions, as shown in Fig. 3(d), which are different than the training gesture action.

Fig. 3 (a) A 3 × 3 camera array for synthetic aperture integral imaging (SAII) in the human gesture recognition experiment. (b) Examples of the training video frames. (c) Example of a true class test video frame. (d) Examples of the false class test gestures.

Download Full Size | PDF

In the experiments, a video sequence containing a human gesture that is partially occluded by plant leaves was captured about 4.5 meters away from the SAII system. Figures 4[a(i-iii)] show three video frames of a captured video sequence under indoor normal illumination conditions. Following Eq. (1), the effect of occlusion is reduced using integral imaging by computationally reconstructing the 3D points at the depth position of the human gesture. Figures 4[b(i-iii)] illustrate three reconstructed frames by integral imaging corresponding to Figs. 4(a). Compared with Figs. 4(a), the front occlusion is substantially reduced in Figs. 4(b), and the parts of the gestures which were not captured directly in the video data can be observed by the integral imaging reconstruction.

Fig. 4 Example of partially occluded human gesture under regular ambient illumination conditions: (a) (i-iii) Three captured 2D elemental images from a single video sequence, and (b) (i-iii) corresponding 4D (x, y, t; z) integral imaging reconstructed frames with occlusion removal at a fixed reconstruction depth z.

Download Full Size | PDF

In another experiment, we consider degraded imaging conditions under low light illumination with partial occlusion for human gesture recognition using the proposed approach. Figures 5[a(i-iii)] show three frames from directly captured video data in the low light environment. With the identical approach under the regular illumination condition, the corresponding reconstructed frames using the SAII system are depicted in Figs. 5[c(i-iii)]. Figures 5[b(i-iii)] show the frames by applying the TV algorithm directly to the captured 2D frames, and Figs. 5[d(i-iii)] are the results by applying the TV algorithm to the integral imaging reconstructed frames. We calculate the signal-to-noise ratio (SNR) of the captured frames [see Figs. 5(a)]. SNR is defined as $S N R = \sqrt{(< g_{o}^{2} > - < n_{o}^{2} >) / < n_{o}^{2} >}$ , where $< g_{o}^{2} >$ is the average power of the object region, and $< n_{o}^{2} >$ is the average power of the noise, which can be obtained from the region of image with no object [26]. The SNR for the captured 2D frame is 0.2356. For photon starved low illumination conditions, the image data is read noise limited and the number of object photons captured per pixel can be further estimated based on the equation:

N_{p h o t o n s} = ϕ t_{e} \approx S N R \times n_{r} / η

where

ϕ

is the incident power of the object,

t_{e}

is the exposure time,

n_{r}

is the camera read noise and

η

is the quantum efficiency. Following Eq. (8), the estimated photons captured per pixel is 10.96 under the degraded condition.

Fig. 5 Image frames for partially occluded human finger gesture [see Fig. 4] under low light illumination conditions: (a) (i-iii) Three separate 2D elemental images from the captured video sequence, (b) (i-iii) 2D elemental images after the total variation (TV) denoising algorithm, (c) (i-iii) integral imaging reconstructed images using 9 perspective video data sets, and (d) (i-iii) integral imaging reconstructed images with the TV denoising algorithm. [a-d (iv)] One dimensional intensity profiles of the finger along the yellow lines in [a-d (ii)].

Download Full Size | PDF

Because of the low illumination conditions and the partial occlusion, the captured video frames have a relatively low dynamic range and it is difficult to observe the object of interest as shown in Figs. 5[a(i-iii)]. Integral imaging reconstruction removes much of the occlusion, as shown in Figs. 5[c(i-iii)], and the shape of the human gesture can be observed. We further apply the TV denoising algorithm to smooth the image and preserve the edges. Comparing Figs. 5(d) with Figs. 5(a)-(c), integral imaging reconstructed frames using the TV denoising algorithm provide enhanced human gesture features. As a reference group, Figs. 5(b) show that the object of interest cannot be visually observed if we directly apply the TV denoising algorithm to the captured 2D frames. Thus, the integral imaging 3D reconstruction with TV denoising algorithm is effective to remove the front occlusion and remedy the effect of low light illumination conditions. Figures 5[a-d (iv)] depict the 1D intensity profile extracted along the yellow lines in Figs. 5[a-d (ii)]. The results indicate improved visualization due to integral imaging and the TV algorithm. In addition, the SNR of the 3D reconstructed frames with TV denoising algorithm [see Fig. 5(d)] is 0.3425, which has approximately a 45% improvement compared with the directly captured 2D frames [see Fig. 5(a)]. The results illustrate that the proposed approach may provide enhanced image quality for 3D visualization under degraded condition. The trained correlation filters are then multiplied with the Fourier transformed test video volume in the frequency domain following Eq. (6), with k set as 0.3 for the best performance of the non-linear correlation process.

To evaluate the performance of the linear distortion-invariant classifier [see Section 3.1] on the experimental data, we generate five Receiver Operating Characteristic (ROC) curves using the (a) linear correlation process (k = 1) for integral imaging reconstructed video data with the TV algorithm, [InIm + TV], and (b) distortion-invariant filter with a non-linear operation (k = 0.3) for (i) captured 2D video data sets, [EI]; (ii) the 2D video data after the TV algorithm, [EI + TV]; (iii) integral imaging reconstructed video data sets, [InIm]; and (iv) integral imaging reconstructed video data with the TV algorithm, [InIm + TV]. We collected 60 human gestures data with 30 true class and 30 false class gesture videos. By adjusting the threshold for the output POE values for both the true and false class correlation outputs, the ROC curves are generated between the five groups. In Fig. 6, the ROC curve corresponds to the linear correlation process (k = 1) with an Area Under the Curve (AUC) of 0.539 [green dashed line], which indicates that the linear distortion-invariant correlation is not effective in recognition of gestures in the presence of our degraded condition experiments. With a non-linear correlation approach (k = 0.3), the ROC curve for the integral imaging reconstructed video with the TV algorithm [InIm + TV, red solid line] has an AUC of 0.897. This is higher than the ROC curves of the other 3 aforementioned data sets, which have AUC of 0.497 for the captured 2D video [EI, black solid line], AUC of 0.557 for the 2D video data with the TV algorithm [EI + TV, magenta dashed line], and AUC of 0.766 for the integral imaging reconstructed video [InIm, blue dash-dotted line].

Fig. 6 ROC (Receiver operating characteristic) curves for human gesture recognition using optimum linear distortion-invariant filter, and non-linear transformations of the filter in the presence of degraded conditions. (a) For the linear correlation process, k = 1, integral imaging reconstructed video with the TV algorithm (InIm + TV, green dashed line). (b) For non-linear correlation process, k = 0.3 [see Eq. (6)], (i) captured 2D video data (EI, black solid line), (ii) captured 2D video with the TV algorithm (EI + TV, magenta dashed line), (iii) integral imaging reconstructed video (InIm, blue dash-dotted line), and (iv) integral imaging reconstructed video with the TV algorithm (InIm + TV, red solid line).

Download Full Size | PDF

We also evaluate the performance of the non-linear distortion-invariant correlator classifier [see Section 3.2] on the experimental data, by generating five Receiver Operating Characteristic (ROC) curves: (i) captured 2D video data sets, [EI] with Eq. (5); (ii) the 2D video data with the TV algorithm, [EI + TV] with Eq. (5); (iii) integral imaging reconstructed video data sets, [InIm] with Eq. (5); (iv) integral imaging reconstructed video data with the TV algorithm, [InIm + TV] with Eq. (5); and (v) integral imaging reconstructed video data with the TV algorithm, where the filter is trained by the averaged template videos as described at the end of Section 3.2 [InIm + TV, averaged template]. As shown in Fig. 7, the ROC curve for the integral imaging reconstructed video with the TV algorithm [InIm + TV, red solid line] has an AUC of 0.921, which is significantly higher than the ROC curves of the other 3 aforementioned data sets, which have AUC of 0.443 for captured 2D video [EI, black solid line], AUC of 0.642 for the 2D video data after the TV algorithm [EI + TV, magenta dashed line], and AUC of 0.803 for the integral imaging reconstructed video [InIm, blue dash-dotted line]. In addition, the AUC of integral imaging reconstructed video with the TV algorithm [InIm + TV, brown dotted line] using the correlation filter trained by the averaged template videos, as described at the end of Section 3.2, is 0.887. Thus, while this non-linear correlator is simple to implement computationally, its performance is quite good in our experiments. Our experimental results demonstrate the potential of using 4D spatial-temporal integral imaging and TV denoising algorithms with non-linear correlation classification for enhanced performance of the human gesture recognition in the presence of degraded conditions.

Fig. 7 ROC (Receiver operating characteristic) curves for human gesture recognition using non-linear distortion-invariant filter in the presence of degraded conditions with k = 0.3, [see Eq. (6)], (i) captured 2D video data (EI, black solid line), (ii) captured 2D video with the TV algorithm (EI + TV, magenta dashed line), (iii) integral imaging reconstructed video (InIm, blue dash-dotted line), (iv) integral imaging reconstructed video with the TV algorithm (InIm + TV, red solid line); and (v) integral imaging reconstructed video with the TV algorithm, where the correlation filter is trained by the averaged template videos as described at the end of Section 3.2 (InIm + TV, brown dotted line).

Download Full Size | PDF

In addition, the degraded conditions may affect the performance of feature points detection. We performed human gesture recognition using an alternative approach as previously reported with standard bag-of-features support vector machine [SVM] framework [39] with the identical integral imaging reconstructed video data sets under low illumination conditions and in the presence of occlusion. The main steps of this approach [39] include (1) using a space-time interest points (STIP) feature detector to extract spatial-temporal points, (2) describing the detected features using 3D histogram of oriented gradients (3D HOG) feature descriptor, (3) quantizing the descriptors by K-means clustering, obtaining the bag-of-words (BoW) representation for each video, and (4) SVM for training the BoW and classification. We generated a confusion matrix to compare the human gesture classification performance under degraded conditions with low light illumination environment and occlusion for the integral imaging sensing system, and various classification algorithms such as distortion-invariant correlation approaches; the standard bag-of-features SVM frame with STIP and 3D HOG for the human gesture recognition. The thresholds for classification of the correlation-based approaches are calculated based on the optimal operating points obtained from the corresponding ROC curves as the red circles shown in Figs. 6 and 7. As illustrated in Table. 1, the confusion matrix shows that the non-linear distortion-invariant correlation-based approaches perform well for the human gesture recognition under degraded conditions and have significantly higher sensitivity and accuracy than the approach using STIP, 3D HOG with SVM.

Table 1. Confusion matrix between a variety of algorithms for human gesture recognition and classification TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative, Tot. = Total

View Table

Note that the integral imaging system includes multiple cameras. One of the limitations is the computational load for the 4D video data sets. We may consider using graphics processing units (GPU) and parallel processing to speed up the computational time. In this work we fixed the reconstructed depth (z) for object recognition (x, y, t; z). We may apply depth estimation approach to generate 4D data (x, y, z, t) for human gesture recognition. In the future work, we plan to continue research on human activity detection and recognition (1) for various degraded conditions through heavily scattering medium such as under fog; (2) Use different 3D integral imaging systems such as axially distributed sensing (ADS) and flexible sensing integral imaging; (3) We will also further develop human gesture recognition algorithms for the specific degraded conditions.

5. Conclusion

In this paper, we have presented for the first time a 4D spatial-temporal recognition method using integral imaging and correlation filters for automated human gesture recognition under degraded conditions in the presence of partial occlusions and low light illumination. Our experiments indicate that conventional imaging methods may not be able to obtain enough features from the degraded scene to successfully recognize the gestures. The spatial-temporal gestures are processed using a variety of algorithms such as linear and non-linear spatial-temporal correlation; space-time interest points feature detector and 3D histogram of oriented gradients feature descriptor with support vector machine, etc. The results are compared using a variety of performance metrics. Experimental results are presented to verify the feasibility and the advantages of the proposed approach. The proposed method may be expanded in future work for human activity detection and recognition under various degraded conditions, using different integral imaging systems and classification algorithms.

Funding

Office of Naval Research (ONR) (N00014-17-1-2561).

Acknowledgments

Xin Shen would like to acknowledge the United Technologies Corporation Institute for Advanced Systems Engineering (UTC-IASE) at the University of Connecticut.

References and links

1. K. R. Gibson and T. Ingold, Tools, language and cognition in human evolution (Cambridge University Press, 1994).

2. S. Mitra and T. Acharya, “Gesture Recognition: A Survey,” IEEE Trans. Syst. Man Cybern. C 37(3), 311–324 (2007). [CrossRef]

3. T. B. Moeslund and E. Granum, “A survey of computer vision-based human motion capture,” Comput. Vis. Image Underst. 81(3), 231–268 (2001). [CrossRef]

4. J. Aggarwal and M. Ryoo, “Human activity analysis: A review,” ACM Comput. Surv. 43, 1–43 (2011).

5. D. Weinland, R. Ronfard, and E. Boyer, “A survey of vision-based methods for action representation, segmentation and recognition,” Comput. Vis. Image Underst. 115(2), 224–241 (2011). [CrossRef]

6. J. M. Chaquet, E. J. Carmona, and A. Fernández-Caballero, “A survey of video datasets for human action and activity recognition,” Comput. Vis. Image Underst. 117(6), 633–659 (2013). [CrossRef]

7. C. R. Wren, A. Azarbayejani, T. Darrell, and A. P. Pentland, “Pfinder: Real-time tracking of the human body,” IEEE Trans. Pattern Anal. 19(7), 780–785 (1997). [CrossRef]

8. Y. Wu and T.S. Huang, “Vision-based gesture recognition: a review,” Lect. Notes Comput. 1739, 103-115 (1999).

9. F. A. Sadjadi and A. Mahalanobis, “Target-adaptive polarimetric synthetic aperture radar target discrimination using maximum average correlation height filters,” Appl. Opt. 45(13), 3063–3070 (2006). [CrossRef] [PubMed]

10. S. Ali and S. Lucey, “Are correlation ﬁlters useful for human action recognition?” in ICPR, IEEE, 2608–2611 (2010).

11. A. Mahalanobis, R. Stanfill, and K. Chen, “A Bayesian approach to activity detection in video using multi-frame correlation filters,” Proc. SPIE 8049, 8049OP (2011).

12. H. Kiani, T. Sim, and S. Lucey, “Multi-channel correlation filters for human action recognition,” Image Processing (ICIP), IEEE International Conference, 1485–1489, (2014). [CrossRef]

13. L. Chen, W. Hong, and J. Ferryman, “A survey of human motion analysis using depth imager,” Pattern Recognit. Lett. 34(15), 1995–2006 (2013). [CrossRef]

14. L. L. Presti and M. La Cascia, “3D skeleton-based human action classification: A survey,” Pattern Recognit. 53, 130–147 (2016). [CrossRef]

15. G. Lippmann, “Épreuves réversibles donnant la sensation du relief,” J. Phys. Theoretical Appl. 7(1), 821–825 (1908). [CrossRef]

16. H. E. Ives, “Optical properties of a Lippmann lenticuled sheet,” J. Opt. Soc. Am. 21(3), 171–176 (1931). [CrossRef]

17. C. B. Burckhardt, “Optimum parameters and resolution limitation of integral photography,” J. Opt. Soc. Am. 58(1), 71–76 (1968). [CrossRef]

18. Y. Igarishi, H. Murata, and M. Ueda, “3D display system using a computer-generated integral photograph,” Jpn. J. Appl. Phys. 17(9), 1683–1684 (1978). [CrossRef]

19. T. Okoshi, “Three-dimensional displays,” Proc. IEEE 68(5), 548–564 (1980). [CrossRef]

20. J. Arai, F. Okano, H. Hoshino, and I. Yuyama, “Gradient-index lens-array method based on real-time integral photography for three-dimensional images,” Appl. Opt. 37(11), 2034–2045 (1998). [CrossRef] [PubMed]

21. H. Hoshino, F. Okano, H. Isono, and I. Yuyama, “Analysis of resolution limitation of integral photography,” J. Opt. Soc. Am. A 15(8), 2059–2065 (1998). [CrossRef]

22. F. Okano, J. Arai, K. Mitani, and M. Okui, “Real-time integral imaging based on extremely high resolution video system,” Proc. IEEE 94(3), 490–501 (2006). [CrossRef]

23. X. Xiao, B. Javidi, M. Martinez-Corral, and A. Stern, “Advances in three-dimensional integral imaging: Sensing, display, and applications,” Appl. Opt. 52(4), 546–560 (2013). [CrossRef] [PubMed]

24. B. Javidi, X. Shen, A. S. Markman, P. Latorre-Carmona, A. Martínez-Uso, J. M. Sotoca, F. Pla, M. Martínez-Corral, G. Saavedra, Y. P. Huang, and A. Stern, “Multidimensional Optical Sensing and Imaging System (MOSIS): From Macroscales to Microscales,” Proc. IEEE 105(5), 850–875 (2017). [CrossRef]

25. B. Javidi, R. Ponce-Díaz, and S. H. Hong, “Three-dimensional recognition of occluded objects by using computational integral imaging,” Opt. Lett. 31(8), 1106–1108 (2006). [CrossRef] [PubMed]

26. A. Stern, D. Aloni, and B. Javidi, “Experiments with three-dimensional integral under low light levels,” IEEE Photonics J. 4(4), 1188–1195 (2012). [CrossRef]

27. A. Markman, X. Shen, and B. Javidi, “Three-dimensional object visualization and detection in low light illumination using integral imaging,” Opt. Lett. 42(16), 3068–3071 (2017). [CrossRef] [PubMed]

28. I. Moon and B. Javidi, “Three-dimensional visualization of objects in scattering medium by use of computational integral imaging,” Opt. Express 16(17), 13080–13089 (2008). [CrossRef] [PubMed]

29. Y.-K. Lee and H. Yoo, “Three-dimensional visualization of objects in scattering medium using integral imaging and spectral analysis,” Opt. Lasers Eng. 77, 31–38 (2016). [CrossRef]

30. M. Cho and B. Javidi, “Peplography-a passive 3D photon counting imaging through scattering media,” Opt. Lett. 41(22), 5401–5404 (2016). [CrossRef] [PubMed]

31. J.-S. Jang and B. Javidi, “Three-dimensional synthetic aperture integral imaging,” Opt. Lett. 27(13), 1144–1146 (2002). [CrossRef] [PubMed]

32. S. H. Hong, J. S. Jang, and B. Javidi, “Three-dimensional volumetric object reconstruction using computational integral imaging,” Opt. Express 12(3), 483–491 (2004). [CrossRef] [PubMed]

33. H. Yoo, “Artifact analysis and image enhancement in three-dimensional computational integral imaging using smooth windowing technique,” Opt. Lett. 36(11), 2107–2109 (2011). [CrossRef] [PubMed]

34. V. Javier Traver, P. Latorre-Carmona, E. Salvador-Balaguer, F. Pla, and B. Javidi, “Human gesture recognition using three-dimensional integral imaging,” J. Opt. Soc. Am. A 31(10), 2312–2320 (2014). [CrossRef] [PubMed]

35. V. Javier Traver, P. Latorre-Carmona, E. Salvador-Balaguer, F. Pla, and B. Javidi, “Three-dimensional Integral Imaging for Gesture Recognition under Occlusions,” IEEE Signal Process. Lett. 22(2), 171–175 (2017). [CrossRef]

36. I. Laptev, “On space-time interest points,” Int. J. Comput. Vis. 64(2–3), 107–123 (2005). [CrossRef]

37. A. Klaser, M. Marszałek, and C. Schmid, “A spatio-temporal descriptor based on 3d-gradients,” in BMVC 19th British Machine Vision Conference (2008), pp. 275.

38. R. Dupre, V. Argyriou, D. Greenhill, and G. Tzimiropoulos, “A 3D Scene Analysis Framework and Descriptors for Risk Evaluation,” 2015 International Conference on 3D Vision, Lyon, 100–108 (2015). [CrossRef]

39. H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluation of local spatio-temporal features for action recognition,” In BMVC 2009-British Machine Vision Conference 124–1 (2009). [CrossRef]

40. L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D 60(1–4), 259–268 (1992). [CrossRef]

41. B. Javidi and J. Wang, “Optimum distortion-invariant filter for detecting a noisy distorted target in nonoverlapping background noise,” J. Opt. Soc. Am. A 12(12), 2604–2614 (1995). [CrossRef]

42. B. Javidi and D. Painchaud, “Distortion-invariant pattern recognition with Fourier-plane nonlinear filters,” Appl. Opt. 35(2), 318–331 (1996). [CrossRef] [PubMed]

43. B. Javidi, “Nonlinear joint power spectrum based optical correlation,” Appl. Opt. 28(12), 2358–2367 (1989). [CrossRef] [PubMed]

		Actual condition
		Linear correlation filter used with nonlinear architecture, k = 0.3			kth-order nonlinear distortion-invariant filter, k = 0.3			STIP, 3D HOG with Bag-of-features SVM framework
		True	False	Tot.	True	False	Tot.	True	False	Tot.
Classified condition	True	TP = 25	FP = 3	28	TP = 28	FP = 7	35	TP = 13	FT = 8	21
	False	FN = 5	TN = 27	32	FN = 2	TN = 23	25	FN = 17	TN = 22	39
	Tot.	30	30	60	30	30	60	30	30	60
Sensitivity / TP rate TP/(TP + FN)		83.3%			93.3%			43.3%
Specificity / TN rate TN/(TN + FP)		90%			76.7%			73.3%
Accuracy / ACC (TP + TN)/(TP + FN + FP + TN)		86.7%			85%			58.3%

Spatial-temporal human gesture recognition under degraded conditions using three-dimensional integral imaging

Abstract

1. Introduction

2. 3D sensing and reconstruction for degraded conditions using integral imaging

3. Linear and non-linear distortion-invariant correlation filters

3.1 Optimum linear distortion-invariant correlation filter

3.2 Non-linear distortion-invariant correlation filter

3.3 Linear and non-linear correlation in the frequency domain

4. Experimental results

5. Conclusion

Funding

Acknowledgments

References and links

Cited By

Figures (7)

Tables (1)

Equations (8)

Optics Express