Spatio-temporal continuous gesture recognition under degraded environments: performance comparison between 3D integral imaging (InIm) and RGB-D sensors

Gokul Krishnan; Yinuo Huang; Rakesh Joshi; Timothy O’Connor; Bahram Javidi

doi:10.1364/OE.438110

1. Introduction

Gestures serve as a natural communication modality between humans and automated gesture recognition has recently gained interest due to its wide range of applications in human-computer interaction systems, sign language recognition, patient monitoring, security, entertainment, robotics, etc [1–4]. In general, hand gesture recognition can be broadly classified into three groups: 1) static 2) trajectory and 3) continuous recognition. Static gesture recognition considers a still image representing each gesture, while in trajectory-based gesture recognition the hand trajectory is also taken into account for extracting spatio-temporal features for gesture recognition. Previously, we have reported spatio-temporal trajectory-based gesture recognition using 3D Integral imaging and deep neural networks [3]. As compared to static and trajectory approaches, the continuous gesture case consists of a series of similar/dissimilar gestures rather than a single gesture [2,4,5]. Segmenting the individual gestures from the continuous video stream can be challenging since the gestures can happen in an arbitrary order and their duration is generally unknown [4]. Numerous approaches have been proposed for temporal segmentation and recognition of gestures from continuous gesture sequences including the use of Dynamic Time Warping (DTW) [6], Hidden Markov Models [7], continuous dynamic programming [8] as well as approaches based on deep neural networks [9], etc.

Many state-of-the-art 3D continuous gesture recognition methodologies are based on RGB-D sensor-based data acquisition and consider non-occluded cases. However, environmental degradations such as partial occlusion, adverse illumination effects etc. can occur, which makes the temporal segmentation and recognition in the continuous case even more challenging. 3D integral imaging-based computational reconstruction provides an efficient way of removing partial occlusions and thereby aids in gesture detection under such degradations [10,11]. Therefore, in this manuscript, we leverage the advantages of passive 3D integral imaging with deep neural networks to propose a continuous gesture recognition system that is more robust under degraded environments such as partial occlusion. We have also provided a performance comparison between two popular 3D data acquisition techniques used for continuous gesture recognition: 1) 3D integral imaging and 2) RGB-D sensors. Our experimental results suggest that 3D integral imaging-based approach may be more robust for spatio-temporal continuous gesture recognition as compared to the RGB-D sensor-based data acquisition, especially for the degraded conditions considered in this manuscript.

3D integral imaging-based computational reconstruction is used to reconstruct depth-segmented gestures at its corresponding depth. The reconstructed data is fed to a single stage detector (i.e. the You Look Only Once version 2.0 (YOLOv2) [12]). The bounding box coordinates and the detection probabilities obtained from the YOLOv2 detector are used for the spatio-temporal segmentation of the gestures from the continuous gesture sequence. The spatio-temporally segmented gesture sequences are finally fed to a CNN-BiLSTM network for classification. This approach enhances the overall gesture recognition performance by multiple means: 1) it performs gesture recognition using a continuous gesture sequence video, and 2) the spatial segmentation improves the classification accuracy by isolating the gesture from other objects in the 3D scene and thereby improving the recognition capabilities.

This paper is organized into four different sections: Section 1 provides the introduction and a brief review of the current state of gesture recognition. Section 2 discusses the proposed approach and its details. Section 3 deals with the experimental results, including the performance of the proposed system, and comparison with 2D imaging and Kinect-based methodologies. Finally, Section 4 provides the conclusions of the paper.

2. Methodology

In this section, we discuss the technical aspects of the proposed spatio-temporal continuous gesture recognition approach in more detail. The block diagram for the proposed approach is shown in Fig. 1. Using 3D integral imaging, the reconstructed gesture at the depth of the gesture of interest is fed to a YOLOv2 detection framework. The bounding box coordinates and the confidence score obtained from the YOLOv2 detector have been used for spatio temporal segmentation of the continuous gesture sequence. In order to mitigate the effect of noisy detection scores, we have used a Savitzky-Golay smoothening filter. The spatio temporal segmented videos are fed to a CNN-based BiLSTM classifier for gesture classification.

Fig. 1. Block diagram of the proposed 3D integral imaging (InIm) based continuous gesture recognition approach.

Download Full Size | PDF

2.1 3D Integral Imaging based computational reconstruction

Integral imaging is a 3D imaging procedure that simultaneously captures both intensity and directional information of a 3D scene using a lenslet array, camera array or a moving camera framework [13–23]. Initially proposed by Lippmann [18], 3D integral imaging-based techniques have proved to be useful for human action and gesture recognition [2,3,11], reducing noise due to partial occlusion [17], low-light imaging [19, 20], and scattering medium [21–23], etc. The computational reconstruction algorithm can reconstruct the gesture at the plane of interest by back projecting the elemental images into the object space through a virtual pinhole array [14]. Using integral imaging-based computational reconstruction algorithm, the spatio-temporal gesture video has been reconstructed at the reconstruction depth z as follows:

(1)$$r\left( {x,y,z;t} \right) = \frac{1}{{O\left( {x,y;t} \right)}}\mathop \sum \nolimits_{i = 0}^{K - 1} \mathop \sum \nolimits_{j = 0}^{L - 1} E{I_{i,j}}\left( {x - i\frac{{{r_x} \times {p_x}}}{{M \times {d_x}}},\; y\; - j\frac{{{r_y} \times {p_y}}}{{M \times {d_y}}};t} \right)$$

where, $r({x,y,z;t} )$ is the integral imaging reconstructed video obtained by shifting and overlapping K × L elemental images at the desired reconstruction depth . The $\textrm{E}{\textrm{I}_{\textrm{i},\textrm{j}}}$ represents the $\textrm{i},\textrm{j}$^th elemental image, and x,y represents their pixel indices. Here, $t$ depicts the frame index of the video. The magnification factor is $M = \; \frac{z}{f}$, where f is the focal length. In Eq. (1), ${\textrm{p}_\textrm{x}}$, ${\textrm{p}_\textrm{y}}$ . indicates the pitch of the adjacent image sensors while ${\textrm{r}_\textrm{x}},$ ${\textrm{r}_\textrm{y}}$ and ${\textrm{d}_\textrm{x}}$, ${\textrm{d}_\textrm{y}}$ represent the resolution and physical size of the image sensor, respectively on the camera array. The matrix $\textrm{O}({\textrm{x},\textrm{y};\textrm{t}} )$ holds information regarding the number of overlapping pixels. The integral imaging camera pickup process and computational reconstruction has been depicted in Fig. 2 (a) and (b), respectively.

2.2 Gesture detection using YoloV2 neural network model

The 3D reconstructed video images obtained using the integral imaging-based computational reconstruction algorithm using Eq. (1) is fed to a YOLOv2 neural network model for detecting the presence of gestures. The YOLO network is a unified object detector, which learn simultaneously to both localize and classify the object of interest in a single step providing a direct mapping from image pixels to bounding box coordinates with their class probabilities as compared to the two-stage region proposal based frameworks such as R-CNN [12, 24, 25, 26]. It has been successfully used for 3D object detection and classification [27]. In the YOLO detection framework, an input image is divided into an $\textrm{S} \times \textrm{S}$ grid, and each grid cell predicts B bounding boxes along with their corresponding confidence scores. For our implementation, we have used $\textrm{S} = 7$ and $\textrm{B} = 2$. The confidence score c is defined as $\textrm{c} = \textrm{P}({\textrm{Obj}} )\times \textrm{IOU}_{\textrm{pred}}^{\textrm{truth}},\textrm{P}({\textrm{Obj}} )> 0$, where $\textrm{P}(\textrm{Obj})$ indicates how likely the object is present in the image and $\textrm{IOU}_{\textrm{pred}}^{\textrm{truth}}$ is the Intersection over Union between the ground truth and the predicted bounding box which indicates the confidence of its prediction. During test time, the class specific confidence score for each bounding box can be obtained as:

(2)$$\begin{aligned} &\textrm{P}({\textrm{Obj}} )\times \textrm{IOU}_{\textrm{pred}}^{\textrm{truth}} \times \textrm{P}\left( {\frac{{\textrm{Clas}{\textrm{s}_\textrm{i}}}}{{\textrm{Obj}}}} \right),\textrm{i} = 1,2,3, \ldots ,\textrm{C} \\ &= \textrm{P}({\textrm{Clas}{\textrm{s}_\textrm{i}}} )\times \textrm{IOU}_{\textrm{pred}}^{\textrm{truth}} \end{aligned}$$

where $\textrm{C}$ indicates the number of classes. The optimal parameters for the network we have been obtained by using the loss function presented in [25]. In practice, many grid cells do not contain an object and the confidence scores of those cells are pushed towards zero. This can result in overpowering of gradients from these cells and may lead to instability. To handle this, the loss from bounding box coordinate predictions is given a higher weight as compared to the loss from boxes that do not contain an object. The network architecture of the YOLOv2 detector is shown in Fig. 3. For the feature extraction module in Fig. 3, we have used the output of a convolutional neural network. The input size of the network is 224×224×3. More detailed descriptions regarding the feature extraction module used in our YOLOv2 implementation has been provided in the Appendix A2. After extracting the features, the output is then fed to a sequentially connected convolution, batch normalization and Rectified Linear Unit (ReLU) layers to form a detection subnetwork which is repeated N times. For our implementation, we set N = 2. Thus, the detection subnetwork consists of 6 layers with the convolution layer performing 3×3 convolutions with stride [1 1]. The activations of the final convolution layer are fed to the transform layer. This layer converts the predictions to fall within the bounds of the ground truth and the output layer the implements the loss function which is needed for training the model.

Fig. 2. (a) Integral imaging passive sensing and image pickup process, (b) computational 3D image and depth reconstruction process using integral imaging.

Download Full Size | PDF

Fig. 3. Network architecture of the YOLOv2 detection framework used for gesture detection. For our implementation, we have used N = 2.

Download Full Size | PDF

YOLOv2 detection network provides improved detection capabilities as compared to the original YOLO framework by adopting strategies such as batch normalization, high resolution classifier, convolution with anchor boxes, multi-scale training, etc. [12]. Additionally, the network changes the image dimensions after every few iterations to aid in learning to predict across a variety of scales. The network outputs the bounding box coordinates along with their confidence scores. In this work, the bounding box coordinates are used for spatial localization, and the confidence scores are used for temporal segmentation. We have used stochastic gradient descent with momentum (sgdm) optimizer with a mini-batch size of 8 and 20 training epochs for 3D InIm. For 2D imaging, we have used a minibatch size and training epochs of 4 and 10, respectively. For Kinect RGB we have used a minibatch size of 2 and trained for 20 epoch whereas for the Kinect depth sensor we used a mini-batch size of 8 and trained for 10 epochs. The learning rate has been set to$\; {10^{ - 3}}$. These parameters were chosen by hyper parameter tuning of each individual network using a validation dataset separate from the training and testing datasets.

2.3 Spatio-temporal continuous gesture segmentation

The output of the detector, namely the bounding box coordinates and their corresponding confidence scores, are used for spatio-temporal segmentation of continuous gesture sequences. Due to the presence of degradations such as occlusion, several misdetections can happen leading to “noisy” confidence scores which degrade the spatio-temporal segmentation performance. To cope with this problem, we apply a non-linear transformation and temporal smoothing of the confidence scores and used the transformed smooth confidence score for segmentation rather than using the original “noisy” scores. Let $\tilde{{\textrm s}}(n)$ be the original confidence score output from the detector with n as the time index. We apply the following non-linear transformation to $\tilde{{\textrm s}}(n)$ to get $\textrm{s}(\textrm{n} )$, $s(n )= {h_h}$ if $\tilde{s}(n )\ge threshold,$ else $s(n )= {h_l}$. We have used ${\textrm{h}_\textrm{h}} = 1,$ ${\textrm{h}_\textrm{l}} = 0$ and $\textrm{threshold} = 0.5$ for our analysis. The transformed score $\textrm{s}(\textrm{n} )$ is further subjected to the Savitzky-Golay smoothing filter and the non-linear thresholding-based transformation as mentioned above to get the “smoothed” confidence score which will be used for temporal segmentation. The Savitzky-Golay smoothing filter is a polynomial smoothing procedure which smooths the data by fitting a lower degree polynomial over successive subsets of adjacent data points using least squares and was chosen for its various desirable properties such as peak shape preservation in the signal or the time series [28]. The optimal coefficient vector for Savitzky-Golay smoothing filter has been obtained using the least squares approach [29]. After obtaining the smoothed detection score, the starting and ending of the isolated gestures can be obtained from the difference vector derived using the difference of successive points in the smoothed detection scores. In cases of points where the bounding boxes were absent, the spatial segmentation of the gesture has been done by approximating the bounding box of the current point as the previous bounding box coordinates.

2.4 Gesture classification using CNN-BiLSTM network

Finally, we have used a CNN-BiLSTM network for classifying the gestures after segmentation. The CNN-BiLSTM neural network framework involves feature extraction using a pretrained convolutional network, then feeding the feature vectors from the segmented video frames to a bidirectional long-short term memory network in order to capture the temporal pattern of the data. For feature extraction, in order to mitigate the effects of lesser training data, we have used the pretrained CNN, trained on the ImageNet dataset [30]. For the pretrained CNN model, we have used a GoogLeNet network [31]. For each frame of the 3D reconstructed video, the pretrained GoogLeNet network outputs a feature vector as a representation of the spatial information content of the video. Then the output feature vectors of subsequent frames are concatenated to produce a feature matrix with each row representing a feature and each column representing a time point in the underlying segmented video data. We have used a BiLSTM network since it outperforms unidirectional LSTM networks [32,33]. The BiLSTM network is a variant of LSTM network [34] consisting of two separate networks, one responsible for learning in forward time direction while the other network for learning in the backward time direction. The network responsible for learning in the forward time direction outputs ${\textrm{h}_{\textrm{forward}}}$ while the network in the reverse time direction outputs ${\textrm{h}_{\textrm{reverse}}}$ Their outputs are concatenated i.e., $\textrm{h} = [{\textrm{h}_{\textrm{forward}}},{\textrm{h}_{\textrm{reverse}}}]$ and is fed to a fully connected layer followed by a softmax and classification layer.

In case of unidirectional LSTMs, the hidden vector can be obtained from Eq. (3) - (7) [34,35]. Figure 4 shows the network architecture used for the classification network. More detailed description regarding the CNN-BiLSTM layers used for the classification network has been provided in Appendix A2. For our implementation, the recurrent weights of the network are randomly initialized from a unit normal distribution. The hyperbolic tangent (tanh) function and the sigmoid functions have been used for the state activation and the gate activation, respectively. We have used adam optimizer with a mini-batch size of 4 and 30 training epochs for 3D InIm. For 2D imaging a mini-batch size of 8 and 40 training epochs were used. The Kinect RGB sensor data used a minibatch size of 32 with 30 training epochs. For the Kinect depth sensor, we have used a minibatch size of 4 and training epochs of 30. We have used BiLSTM layer with 100 hidden units for 3D InIm and Kinect depth sensor, while for 2D imaging and Kinect RGB, we used BiLSTM layer with 200 hidden units. We set the learning rate to 10⁻⁴. As with the neural network-based detector, these parameters have been obtained from hyper parameter tuning using the validation dataset which is separate from the training and the testing datasets.

Fig. 4. Architecture for the CNN-BiLSTM classification framework. CNN: - Convolutional Neural Network, Bi-LSTM: - Bi-directional long short term memory network.

Download Full Size | PDF

3. Experimental results and discussions

In this section, we analyze the performance of the 3D InIm based approach and compare it with that of 2D imaging, and RGB-D sensor. For our experiments, we used a 3×3 camera array with Mako G192C cameras having identical intrinsic parameters. The camera pixel size is 4.5 µm × 4.5 µm and the pitch of the camera array was designed to be 80 mm in both x and y directions. The focal length of each camera is set to be 15 mm and the sensor dimensions are 1200 × 1600 pixels. The quantum efficiency of the camera is 0.44 at a wavelength of 525 nm. The sensor read noise is 20.47 electrons rms/pixel. All the cameras in the array are synchronized and the data is recorded at a frame rate of 10 frames per second (fps). The exposure time is set to be 30 milliseconds (ms). For the RGB-D sensor, we have used Azure Kinect DK for data collection. The resolution for the RGB camera is 1920 × 1080, while for the depth sensor the resolution is 640 × 576. The data has been collected with Narrow Field of View (“NFOV unbinned”) mode of operation with the Kinect. The data is initially recorded with a frame rate of 30 fps and converted to 10 fps for a fair comparison with integral imaging capture set up. In addition, the field of view and the resolution of the camera array and the Kinect are different, therefore cropping and resizing [11] has been done so that the images have comparable effective resolution for the region of interest.

For our experiments, we have considered three classes of gestures as depicted in Fig. 5, The data were collected from 6 participants with 5 different backgrounds, at a distance of about 2 meters from the camera array and Kinect as shown in Fig. 5. We have collected data under two different conditions: 1) isolated gesture without any degradation such as partial occlusion, and 2) continuous gesture sequence under partial occlusion. The depth of the gesture of interest is assumed to be known a priori for simplicity. This assumption is for convenience and may not be necessary as the integral imaging can reconstruct the in-focus gesture of interest in the 3D scene and segment out 3D objects in the scene. In the case of isolated gestures, the participants were asked to repeat the gesture twice for each background in order to capture the fast and slow variations of the gesture. Each continuous gesture videos consists of two different gestures happening at random intervals and duration within the video. From Fig. 6, we can see that the integral imaging computational reconstruction reduces the effect of partial occlusion thereby enhances the performance as compared with other imaging modalities.

Fig. 5. a) 3 × 3 camera array for integral imaging capture stage used for our experiments. (b) depicts the three different gesture motions considered in this paper. (c) shows a single 3D integral imaging reconstructed gesture with different scene backgrounds used for our experiments. Integral imaging segments out the gesture of interest from the background.

Download Full Size | PDF

Fig. 6. a) Sample video frames from integral imaging (3D InIm), 2D imaging, Kinect RGB (Kin. RGB), and Kinect depth (Kin. depth) used for training neural network. (b) sample video frames from 3D InIm, 2D imaging, Kinect RGB and Kinect depth used for testing the gesture recognition approach. InIm:- Integral imaging.

Download Full Size | PDF

In total, for training we have 180 videos, with 60 videos for each gesture. For the case of continuous gesture videos, we have considered three different combinations of two gestures, thereby having 90 videos in total. The continuous gesture videos are split into validation and testing datasets. The dataset from one participant has been used for validation (15 videos), while the data collected from the other five participants are used for testing (75 videos). In order to improve the performance of the gesture classification using the neural network model, we adopted data augmentation such as blurring and inversion. Thus, following data augmentation, we have 540 videos in total for training the neural network, with 180 videos for each gesture.

3.1 Performance analysis of gesture detection

For evaluating the performance of the detector, we have used the log miss rate – false positive per image curve. It is obtained by plotting the miss rate $\left( {MR = \; \frac{{False\; negative}}{{True\; positive + False\; negative}}} \right)$ vs the number of false positives per image (FPPI) by varying the threshold [36]. In the case of log miss rate-false positive per image curve, the lower the curve the better the detection performance. In addition, we have used the average precision and average log miss rate for comparing the performance of different modalities used in the manuscript. From Fig. 7, we can see that the 3D InIm based approach provides a significantly higher performance as compared to 2D imaging and RGB-D sensor-based approaches for detection. In case of Kinect depth data, since the network input layer has three channels, we stacked the depth image to form a three-channel data before providing as input to the YOLOv2 detector. In principle, Kinect RGB and 2D imaging (a single 2D elemental image from an array of cameras) appear to be similar modality. However, since our aim is to compare two popular 3D imaging modalities for continuous gesture recognition and since the Kinect provides the RGB as well as the depth information, we have included the results for Kinect RGB along with the other modalities for performance comparison. In cases where we got the same validation accuracies for different set of parameters or if the results on validation dataset was too poor, we reported the best test accuracy achieved for 2D imaging, Kinect RGB and depth sensor for comparison with the proposed 3D InIm-based approach.

Fig. 7. Performance comparison (miss rate vs. the number of false positives per image (FPPI)) of the detector for various imaging modalities: 3D Integral imaging (blue), 2D imaging (red), Kinect RGB (yellow) and Kinect depth (violet). Lowest value curves have the best detection performance. InIm:- Integral imaging.

Download Full Size | PDF

3.2 Performance analysis of spatio-temporal gesture segmentation

In the case of spatio-temporal segmentation, the Savitzky-Golay smoothing filter is characterized by two parameters: the frame length and the polynomial order. To choose these parameters, we performed a 2D grid search on the validation set and chose the optimal frame length and polynomial order as the ones which maximize the signal to noise ratio (SNR) between the ground truth label and the smoothed confidence scores. The frame length was set as $(N)$ = 23 with a polynomial order of 1 for 3D InIm. For 2D imaging a frame length of 63 and polynomial order of 2 were chosen. For the Kinect RGB the optimal frame length and polynomial order were 87 and 1 respectively. Lastly for the Kinect depth sensor these parameters were determined as 41 and 2, respectively. Figure 8 illustrates the advantage of using the “smoothed” score rather than the raw confidence score from the detector for temporal segmentation wherein the smoothing removes much of the noise in the detection scores to detect starting and ending points of gestures in the continuous gesture sequence, enabling better temporal segmentation.

Fig. 8. Sample frames from a continuous gesture sequence with corresponding a) the ground truth confidence scores (green), (b) raw confidence scores from the detector (red), and c) “smoothed” confidence score (blue).

Download Full Size | PDF

3.3 Gesture classification performance analysis

For comparing the final classification results of the proposed approach with different modalities, we have computed the Receiver Operating Characteristic (ROC) curves as shown in Fig. 9. In addition to accuracy, the area under the ROC curve (AUC) has been widely used metric to investigate different classifiers’ performance [37,38]. We have presented three ROC curves for showing the performance comparison of different modalities for three different cases: a) gesture 1 (True class) vs gesture 2 and gesture 3 (False class), b) gesture 2 (True class) vs gesture 1 and gesture 3 (False class), and c) gesture 3 (True class) vs gesture 1 and gesture 2 (False class). For all the three cases considered in this manuscript, the 3D InIm based approach provides a considerably higher performance as compared to the other imaging modalities. The final gesture classification results are summarized in Table 1. In addition to accuracy and AUC, we have also included Mathew's correlation coefficient (MCC) as a performance metric due to its advantages over accuracy and F1 score [39,40]. The MCC computes the correlation between the ground truth and the classification predictions. Its value ranges from -1 to 1, where -1 indicates complete disagreement and 1 indicates perfect classification predictions.

Table 1. Table showing the comparison of the proposed deep learning continuous gesture recognition approach with different imaging modalities. CNN – Convolutional Neural Network, AUC – Area under the ROC curve, MCC- Mathews Correlation Coefficient

View Table

Thus, from Fig. 9 and Table 1, we can see that the proposed approach using 3D InIm achieves a considerable improvement in performance as compared to other modalities for the experimental conditions considered in this manuscript. For multi-class case, the 2D imaging, Kinect RGB, and the depth sensors perform as poorly as a random classifier. In Table 1, for the binary classification (for example, gesture 1 (True class) vs gesture 2 and gesture 3 (False class)), the 2D imaging, Kinect RGB, and the depth sensors have accuracy that is slightly higher than a random classifier, with the AUC and MCC values close to a random or slightly better than a random classifier in most cases considered. The computation time (in seconds) required for the proposed approach is 51.82, 24.20, 26.47, and 0.45 respectively for the 3D InIm reconstruction, gesture detection, temporal segmentation and gesture classification. The computation time has been calculated for a 358-frame test video (thus 0.29 seconds/frame). However, the computation time for 3D InIm reconstruction can be further reduced by using GPU based stream-processing [41].

Fig. 9. Receiver operating characteristics (ROC) curves for gesture classification experiments. (a) gesture 1 – True class; gesture 2 and gesture 3 – False class. (b) gesture 2 – True class; gesture 1 and gesture 3 – False class. (c) gesture 3 – True class; gesture 1 and gesture 2 – False class. InIm:- Integral imaging.

Download Full Size | PDF

4. Conclusion

In summary, we have presented a continuous gesture recognition system based on 3D integral imaging and deep neural networks and compared the performance of the proposed method with 2D imaging and RGB-D sensor under partial occlusion. The results obtained from our experiments suggest that using the 3D integral imaging-based approach substantially improves the performance as compared to other imaging modalities under degraded environments such as partial occlusion. Future experiments may consider deep learning-based 3D InIm systems for more challenging conditions such as low illumination with improved speed.

Appendix A1: Long Short-Term Memory (LSTM) hidden vector computation

Let the input data constitute T time steps, $\textrm{x} = [{\textrm{x}_1},{\textrm{x}_2},{\textrm{x}_3}, \ldots ,{\textrm{x}_\textrm{T}}],$ the hidden vector $\textrm{h} = [{\textrm{h}_1},{\textrm{h}_2},{\textrm{h}_3}, \ldots ,{\textrm{h}_\textrm{T}}]$ and $\mathrm{\sigma }(\textrm{x} )= 1/({1 + {\textrm{e}^{ - \textrm{x}}}} )$. Let, ${\textrm{W}_{\textrm{mn}}}$, ${\textrm{b}_\textrm{m}}, {\textrm{m}},\varepsilon \{ {\textrm{x,h}}\}$ and ${\textrm{n}} \epsilon \{ {\textrm{o,c,f,i}}\}$ denote the corresponding weight matrices and bias terms, respectively, a standard LSTM network computes the hidden vector using the following relationships [34,35],

(3)$${\textrm{i}_\textrm{t}} = \mathrm{\sigma }({{\textrm{W}_{\textrm{xi}}}{\textrm{x}_\textrm{t}} + {\textrm{W}_{\textrm{hi}}}{\textrm{h}_{\textrm{t} - 1}} + {\textrm{b}_\textrm{i}}})$$

(4)$${\textrm{f}_\textrm{t}} = \mathrm{\sigma }({{\textrm{W}_{\textrm{xf}}}{\textrm{x}_\textrm{t}} + {\textrm{W}_{\textrm{hf}}}{\textrm{h}_{\textrm{t} - 1}} + {\textrm{b}_\textrm{f}}} )$$

(5)$${\textrm{c}_\textrm{t}} = {\textrm{f}_\textrm{t}}{\textrm{c}_{\textrm{t} - 1}} + {\textrm{i}_\textrm{t}}\textrm{tanh}({{\textrm{W}_{\textrm{xc}}}{\textrm{x}_\textrm{t}} + {\textrm{W}_{\textrm{hc}}}{\textrm{h}_{\textrm{t} - 1}} + {\textrm{b}_\textrm{c}}} )$$

(6)$${\textrm{o}_\textrm{t}} = \mathrm{\sigma }({{\textrm{W}_{\textrm{xo}}}{\textrm{x}_\textrm{t}} + {\textrm{W}_{\textrm{ho}}}{\textrm{h}_{\textrm{t} - 1}} + {\textrm{b}_\textrm{o}}} )$$

(7)$${\textrm{h}_\textrm{t}} = {\textrm{o}_\textrm{t}}\textrm{tanh}({{\textrm{c}_\textrm{t}}} )$$

where $ \textrm{t} = ({1,2,3, \ldots ,\textrm{T}} )$. Here, i, f, o, are the input, forget and output gates respectively and c represents the cell state vector.

Appendix A2: Network layersused for the detector and the classifier

Network layers used in feature extraction module for YOLOv2 detection is shown in Fig. 10.

Fig. 10. Network layers used in the feature extraction module for YOLOv2 used for gesture detection (Fig. 3). The layers are based on pretrained ResNet50 [42], trained on ImageNet dataset [30].

Download Full Size | PDF

The network layers for the CNN-BiLSTM classifier used for gesture classification is shown in Fig. 11.

Fig. 11. (a) Network layers for the CNN-BiLSTM based classifier used for gesture classification (see Fig. 4) (b) Layers of the inception module.

Download Full Size | PDF

Appendix A3: Training loss curves for the detector and the classifier

The training loss for the YOLOv2 detector for different modalities is shown in Fig. 12.

Fig. 12. Training loss (for YOLOv2) for (a) 3D InIm, (b) 2D Imaging (c) Kinect RGB and (d) Kinect depth. Here, the x-axis represents the number of iterations and y-axis represents the training loss for each iteration.

Download Full Size | PDF

The training loss curves for the CNN-BiLSTM classifier for different modalities is shown in Fig. 13.

Fig. 13. Training loss for the CNN-BiLSTM classifier (a) 3D InIm, (b) 2D Imaging (c) Kinect RGB and (d) Kinect depth. Here, the x-axis represents the number of iterations and y-axis represents the training loss for each iteration

Download Full Size | PDF

Funding

Air Force Office of Scientific Research (FA9550-18-1-0338, FA9550-21-1-0333); Office of Naval Research (N000141712405, N00014-17-1-2561, N000142012690).

Acknowledgments

T. O'Connor acknowledges support via the GAANN fellowship through the Department of Education. We would like to thank Aleksandr Samegulin for his help during initial Kinect data collection process. Any opinions, findings, and conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the U.S. Department of Defense.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. S. Mitra and T. Acharya, “Gesture Recognition: A Survey,” IEEE Trans. Syst., Man, Cybern. C 37(3), 311–324 (2007). [CrossRef]

2. B. Javidi, F. Pla, J. M. Sotoca, X. Shen, P. Latorre-Carmona, M. Martínez-Corral, R. Fernández-Beltrán, and G. Krishnan, “Fundamentals of automated human gesture recognition using 3D integral imaging: a tutorial,” Adv. Opt. Photonics 12(4), 1237–1299 (2020). [CrossRef]

3. G. Krishnan, R. Joshi, T. O’Connor, F. Pla, and B. Javidi, “Human gesture recognition under degraded environments using 3D-integral imaging and deep learning,” Opt. Express 28(13), 19711–19725 (2020). [CrossRef]

4. S. Escalera, V. Athitsos, and I. Guyon, “Challenges in multimodal gesture recognition,” J. Mach. Learn. Res. 17(72), 1–54 (2016).

5. Y. Song, D. Demirdjian, and R. Davis, “Continuous Body and Hand Gesture Recognition for Natural Human-Computer Interaction,” ACM Trans. Interact. Intell. Syst. 2(1), 1–28 (2012). [CrossRef]

6. H. Li and M. Greenspan, “Model-based segmentation and recognition of dynamic gestures in continuous video streams,” Pattern Recognition 44(8), 1614–1628 (2011). [CrossRef]

7. M. Elmezain, A. Al-Hamadi, J. Appenrodt, and B. Michaelis, “A Hidden Markov Model-based continuous gesture recognition system for hand motion trajectory,” in 19th International Conference on Pattern Recognition (2008), pp. 1–4. [CrossRef]

8. H. Li and M. Greenspan, “Segmentation and Recognition of Continuous Gestures,” in IEEE International Conference on Image Processing (2007), 1, pp. 365–368.

9. Z. Liu, X. Chai, Z. Liu, and X. Chen, “Continuous Gesture Recognition with Hand-Oriented Spatiotemporal Feature,” in IEEE International Conference on Computer Vision Workshops (ICCVW) (2017), pp. 3056–3064.

10. X. Shen, H. Kim, K. Satoru, A. Markman, and B. Javidi, “Spatial-temporal human gesture recognition under degraded conditions using three-dimensional integral imaging,” Opt. Express 26(11), 13938–13951 (2018). [CrossRef]

11. V. J. Traver, P. Latorre-Carmona, E. Salvador-Balaguer, F. Pla, and B. Javidi, “Three-Dimensional Integral Imaging for Gesture Recognition Under Occlusions,” IEEE Signal Process. Lett. 24(2), 171–175 (2017). [CrossRef]

12. J. Redmon and A. Farhadi, “YOLO 9000: Better, Faster, Stronger,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6517–6525.

13. M. Martínez-Corral and B. Javidi, “Fundamentals of 3D imaging and displays: a tutorial on integral imaging, light-field, and plenoptic systems,” Adv. Opt. Photonics 10(3), 512–566 (2018). [CrossRef]

14. S.-H. Hong, J.-S. Jang, and B. Javidi, “Three-dimensional volumetric object reconstruction using computational integral imaging,” Opt. Express 12(3), 483–491 (2004). [CrossRef]

15. N. Davies, M. McCormick, and L. Yang, “Three-dimensional imaging systems: a new development,” Appl. Opt. 27(21), 4520–4528 (1988). [CrossRef]

16. C. B. Burckhardt, “Optimum Parameters and Resolution Limitation of Integral Photography,” J. Opt. Soc. Am. 58(1), 71–76 (1968). [CrossRef]

17. B. Javidi, R. Ponce-Díaz, and S.-H. Hong, “Three-dimensional recognition of occluded objects by using computational integral imaging,” Opt. Lett. 31(8), 1106–1108 (2006). [CrossRef]

18. G. Lippmann, “Epreuves reversibles donnant la sensation du relief,” J. Phys. 7(1), 821–825 (1908). [CrossRef]

19. A. Stern, D. Aloni, and B. Javidi, “Experiments With Three-Dimensional Integral Imaging Under Low Light Levels,” IEEE Photonics J. 4(4), 1188–1195 (2012). [CrossRef]

20. A. Markman, X. Shen, and B. Javidi, “Three-dimensional object visualization and detection in low light illumination using integral imaging,” Opt. Lett. 42(16), 3068–3071 (2017). [CrossRef]

21. M. Cho and B. Javidi, “Peplography—a passive 3D photon counting imaging through scattering media,” Opt. Lett. 41(22), 5401–5404 (2016). [CrossRef]

22. I. Moon and B. Javidi, “Three-dimensional visualization of objects in scattering medium by use of computational integral imaging,” Opt. Express 16(17), 13080–13089 (2008). [CrossRef]

23. B. Javidi, A. Carnicer, J. Arai, T. Fujii, H. Hua, H. Liao, M. Martínez-corral, F. Pla, A. Stern, L. Waller, Q. H. Wang, G. Wetzstein, M. Yamaguchi, and H. Yamamoto, “Roadmap on 3D integral imaging: sensing, processing, and display,” Opt. Express 28(22), 32266–32293 (2020). [CrossRef]

24. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017). [CrossRef]

25. Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object Detection With Deep Learning: A Review,” IEEE Trans. Neural Netw. Learning Syst. 30(11), 3212–3232 (2019). [CrossRef]

26. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 779–788.

27. W. Ali, S. Abdelkarim, M. Zidan, M. Zahran, and A. El Sallab, “YOLO3D: End-to-End Real-Time 3D Oriented Object Bounding Box Detection from LiDAR Point Cloud,” in Computer Vision – ECCV 2018 Workshops, L. Leal-Taixé and S. Roth, eds. (Springer International Publishing, 2019), pp. 716–728.

28. R. W. Schafer, “What Is a Savitzky-Golay Filter? [Lecture Notes],” IEEE Signal Process. Mag. 28(4), 111–117 (2011). [CrossRef]

29. S. J. Orfanidis, Introduction to Signal Processing (Prentice-Hall, Inc., 1995).

30. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (2009), pp. 248–255.

31. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 1–9.

32. P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri, “Exploiting the past and the future in protein secondary structure prediction,” Bioinformatics 15(11), 937–946 (1999). [CrossRef]

33. M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Trans. Signal Process. 45(11), 2673–2681 (1997). [CrossRef]

34. S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation 9(8), 1735–1780 (1997). [CrossRef]

35. J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 4694–4702.

36. N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) (2005), 1, pp. 886–893.

37. E. Keedwell, “An analysis of the area under the ROC curve and its use as a metric for comparing clinical scorecards,” in IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (2014), pp. 24–29.

38. S. Alam, O. Odejide, O. Olabiyi, and A. Annamalai, “Further results on area under the ROC curve of energy detectors over generalized fading channels,” in 34th IEEE Sarnoff Symposium (2011), pp. 1–6.

39. D. Chicco and G. Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC Genomics 21(1), 6 (2020). [CrossRef]

40. J. Gorodkin, “Comparing two K-category assignments by a K-category correlation coefficient,” Comput. Biol. Chem. 28(5-6), 367–374 (2004). [CrossRef]

41. F. Yi, I. Moon, J.-A. Lee, and B. Javidi, “Fast 3D Computational Integral Imaging Using Graphics Processing Unit,” J. Disp. Technol. 8(12), 714–722 (2012). [CrossRef]

42. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 770–778.

Method	Condition	Accuracy	AUC	MCC
Proposed 3D integral imaging-based Continuous gesture recognition approach.	Gesture 1 True class	90.97%	0.9773	0.7985
	Gesture 2 True class	91.67%	0.9701	0.8183
	Gesture 3 True class	93.75%	0.9838	0.8587
	Multi-class	88.19%	-	0.8252
2D imaging-based continuous gesture recognition approach	Gesture 1 True class	67.68%	0.6096	0.0776
	Gesture 2 True class	70.71%	0.5155	0.1490
	Gesture 3 True class	40.40%	0.5423	0.0264
	Multi-class	39.39%	-	0.0843
Kinect RGB sensor-based continuous gesture recognition approach.	Gesture 1 true class	74.29%	0.8046	0.4348
	Gesture 2 True class	70.71%	0.4545	0.1210
	Gesture 3 True class	40.40%	0.4737	0.0789
	Multi-class	45.71%	-	0.2116
Kinect depth sensor-based continuous gesture recognition approach.	Gesture 1 True class	59.04%	0.6986	0.1275
	Gesture 2 True class	69.88%	0.3404	0.0704
	Gesture 3 True class	40.96%	0.4786	0.1382
	Multi-class	34.94%	-	0.1121

Spatio-temporal continuous gesture recognition under degraded environments: performance comparison between 3D integral imaging (InIm) and RGB-D sensors

Abstract

1. Introduction

2. Methodology

2.1 3D Integral Imaging based computational reconstruction

2.2 Gesture detection using YoloV2 neural network model

2.3 Spatio-temporal continuous gesture segmentation

2.4 Gesture classification using CNN-BiLSTM network

3. Experimental results and discussions

3.1 Performance analysis of gesture detection

3.2 Performance analysis of spatio-temporal gesture segmentation

3.3 Gesture classification performance analysis

4. Conclusion

Appendix A1: Long Short-Term Memory (LSTM) hidden vector computation

Appendix A2: Network layersused for the detector and the classifier

Appendix A3: Training loss curves for the detector and the classifier

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (13)

Tables (1)

Equations (7)

Optics Express