Illumination invariant recognition and 3D reconstruction of faces using desktop optics

Ajmal Mian

doi:10.1364/OE.19.007491

1. Introduction

Face recognition under varying illumination is a challenging problem because it dramatically changes the facial appearance. Expressions and pose variations can introduce further challenges. However, they are less problematic in many applications where the subject is cooperative. Face recognition has applications in security, access control, surveillance and human computer interaction. Zhao et al. [1] provide a detailed survey of face recognition algorithms and categorize them into holistic techniques which match global features of the face [2, 3], feature-based techniques which match local features [4] and hybrid techniques which use both.

The human visual perception has inspired many researchers to use video sequences for constructing a joint representation of the face in the spatio-temporal domain [1]. Arandjelovic and Cipolla [5] proposed shape-illumination manifolds. They first find the best match to a video sequence in terms of pose and then re-illuminate them based on the manifold. Appearance manifolds, under changing pose, were also used by Lee and Kriegman [6] for face recognition. Liu et al. [7] performed online learning for multiple image based face recognition without using a pre-trained model. Tangelder and Schouten [8] used a sparse representation of multiple still images for face recognition. These techniques rely on long term changes implying longer acquisition times. They assume that the images contain non-redundant information either due to changes in pose or the motion of the facial features due to expressions.

Multiple images of a face acquired instantly from a fixed viewpoint are mostly redundant unless the illumination is varied. We propose a novel setup where the computer screen is used as a programmable extended light source to vary the illumination (see Fig. 1). The Contourlet coefficients [9] of the acquired facial images are extracted at different scales and orientations. These coefficients are projected to separate PCA subspaces, normalized and stacked to form feature vectors. The features are projected to a common linear subspace for dimensionality reduction and used for illumination invariant face recognition. We also show that the same setup can be used to reconstruct 3D faces. Face models from 3D scanners are used to train multiple SVMs [10]. During testing, the SVMs take three unseen images per face under arbitrary illumination to estimate their 3D models which are then quantitatively compared to ground truth.

Fig. 1 Multiple images of a face are acquired while illumination is varied by scanning a white stripe on a computer screen.

Download Full Size | PDF

This paper is organized as follows. Section 2.1 surveys illumination invariant face recognition and highlights the advantages of desktop optics for this purpose. It also differentiates our feature extraction approach from other Contourlets based techniques. Section 2.2 surveys 3D face reconstruction techniques that use multiple images or the computer screen as illuminant. Section 3 describes our data acquisition setup. Section 4 gives details of the proposed feature extraction methodology and the classification criterion. Section 5 describes the 3D reconstruction algorithm. Section 6 and 7 give detailed results and analysis of illumination invariant face recognition and 3D face reconstruction respectively. Section 8 concludes our findings.

2. Literature review

2.1. Illumination invariant face recognition

Belhumeur and Kriegman used multiple images under arbitrary point source illuminations to construct the illumination cone and 3D shape of objects [11]. Lee et al. [12] extended the idea to construct 3D faces and their corresponding albedos which were subsequently used to synthesize a large number (80–120) of facial images under novel illuminations. The synthetic images were used to estimate the illumination cone of the face for illumination invariant recognition. Hallinan [13] and Basri and Jacobs [14] separately showed that the illumination cone can be approximated by a low dimensional subspace (five to nine). There exist nine universal virtual lighting conditions such that the images under these illuminations are sufficient to approximate the illumination cone [15]. Lee et al. [15] avoided the calculation of 3D face and its albedo (as in [12, 14]) by constructing a linear subspace from nine physical lighting directions for illumination invariant face recognition. However, some of the suggested light source directions [15] are at angles greater than 100 degrees which are difficult to achieve due to space limitations.

Point light sources must have high intensity thereby increasing specularities which are nonlinear and difficult to handle. Schechner et al. [16] showed that images under multiplexed illumination of a collection of lower intensity point sources can offer better signal to noise ratio. This alleviates the need for high intensity but the problem of specularities still remains. Lee et al. [15] suggest that the superposition of images under different point sources or images with a strong ambient component are more effective for recognition. These findings naturally hint towards studying face recognition under extended light sources which is the focus of this paper. We investigate the construction of a subspace representation of the face for illumination invariant recognition using extended light sources. Extended light source can be placed close to the face (without the risk of specularities) alleviating the need for large space and high brightness.

The Contourlet transform has been used before for face recognition [17, 18]. However, these techniques use a global PCA space for projection, do not explicitly model illumination or use desktop optics for doing so. Besides using desktop optics and illumination modeling, our approach differs because it builds separate PCA spaces for each scale and orientation. The Contourlet coefficients at each scale and orientation are projected to their respected PCA subspaces and the variation along each dimension is normalized. This process preserves unique features, rather than only the most varying features, at every scale and orientation. We achieve better results, on the same datasets, compared to [18] and state-of-the-art techniques (Section 6.4).

2.2. 3D face modeling using images/the computer screen as illuminant

Our recognition algorithm was initially proposed in [19]. Here, we extend the technique to 3D face modeling as well which has many applications in computer graphics and virtual reality. Accurate 3D models can be acquired with laser scanning or structured light techniques. However, they require additional hardware and the acquisition speed is slow. The projection of lasers may be inconvenient and socially unacceptable to users. Active devices also fail to acquire dark regions (e.g. eyebrows) and changes in ambient illumination can still effect their accuracy [20].

3D modeling from images is attractive because images are easy to acquire. Passive stereo based 3D face modeling requires two cameras and cannot provide dense 3D models due to the correspondence problem. Scharstein et al. [21] present a taxonomy of stereo correspondence algorithms and discuss their limitations. Blanz and Vetter [22] estimate the 3D face model from a single image using a morphable 3D model. They learn a PCA model from a large number of textured 3D facial scans. The model is morphed until it produces an image similar to the input. The final parameters of the morphed model basically represent the face, independent of pose and illumination, and can be used for invariant recognition. This technique is computationally expensive and requires manual identification of many control points on the face.

The 3D shape of a convex object with Lambertian surface viewed from the same pose under unknown illumination (distant point light sources) can be recovered from three or more images [11]. However, the 3D shape can be determined only up to a three parameter Generalized Bas-Relief (GBR) transform i.e. with ambiguous scale, slant and tilt. For human faces, assumptions such as facial symmetry and distance between facial landmarks have been used to estimate the ambiguous parameters [12]. In this paper, we take a model-based approach to 3D face reconstruction which avoids the GBR ambiguity. In contrast to previous techniques, we use an extended light source (the computer screen) and an empirical approach.

Schindler [23] and Funk and Yang [24] used the computer screen for photometric stereo. In [24], the screen was calibrated by measuring the irradiance at a scene point from a display pixel. The calibration model was a function of azimuth and elevation angles. Recently, Clark [25] showed that an equivalent point light source can be obtained for distinct illumination patterns, allowing the application of standard photometric stereo. Clark performed photometric calibration of the LCD screen as well as the camera to account for the non-linear relationship of the true and displayed/measured RGB values.

The above techniques do not provide quantitative comparison of the reconstructed 3D shapes with ground truth. We report quantitative results (Section 6) to demonstrate the accuracy of our approach which is also more efficient and does not require cumbersome calibration procedures as in [24, 25]. Our hardware can be accommodated on a desktop whereas distant light sources require a dedicated room [12, 26]. Our extended light source avoids specularities and reduces cast shadows which are known sources of errors in photometric stereo.

3. Data acquisition

Figure 1 shows our image acquisition setup. Many computer screens have inbuilt webcams therefore, no additional hardware is required. For greater variability in the incident light angle and good signal to noise ratio, the subject must not be far from the screen. Image capture is automatically initiated using face detection [27] or it can be manually initiated. A white horizontal stripe scans from the top to bottom of the screen followed by a vertical stripe which scans from left to right. The stripe was 200 pixels wide and 8 images were captured during vertical scan and 15 during horizontal scan. A final image was captured in ambient light for subtracting from all other images if required. All images are normalized so that a straight horizontal line passes through the center of their eyes. The scale is also normalized based on the manually identified centers of eyes and lips. The manual identification of eyes and lips can be replaced with automatic detection on the basis of all 23 images. See Fig. 2 for normalized images. A mask was used to remove the lower corners of the image. We imaged 106 subjects over a period of eight months. Out of these, 83 appeared in two different sessions with an average of 60 days gap.

Fig. 2 Sample faces after preprocessing.

Download Full Size | PDF

4. Subspace feature representation and classification

We use the Contourlet transform [9] (an extension of Wavelets) for extracting features to construct subspaces. Gabor wavelets have been well studied for face recognition [4] and a survey can be found in [28]. Wavelets provide a time-frequency representation of signals and are good at analyzing point (or zero dimensional) discontinuities. Wavelets are suitable for analyzing one dimensional signals. However, images are inherently two dimensional and can have one dimensional discontinuities such as curves. These discontinuities can be captured more efficiently by Contourlets [9]. The Contourlet transform performs multi-directional decomposition of images at multiple scales allowing for different number of directions at each scale [9].

Let $a_{i}^{sk}$ represent the vector of Contourlet coefficients of the ith image (where i = 1...23) at scale s and orientation k. The Contourlet transform has 33% inherent redundancy [9]. Moreover, the Contourlet coefficients (at the same scale and orientation) of many faces can be approximated by a much smaller linear subspace. Therefore, the Contourlet coefficients of all training images (at the same scale and orientation) are projected separately to PCA subspaces.

Let $A^{sk} = [a_{i j}^{sk}]$ (where i ∈ {1,2...23}, and j = 1,2,...G) represent the matrix of Contourlet coefficients of N training images (under different illuminations) of G subjects in the training data at the same scale s and same orientation k. Note that only a subset of the 23 images under different illuminations are used for training. Each column of A ^sk contains the Contourlet coefficients of one image. The mean of the matrix is given by

μ^{sk} = \frac{1}{N \times G} \sum_{n = 1}^{N \times G} A_{n}^{sk},

and the covariance matrix by

C^{sk} = \frac{1}{N \times G} \sum_{n = 1}^{N \times G} (A_{n}^{sk} - μ^{sk}) {(A_{n}^{sk} - μ^{sk})}^{T} .

The eigenvectors of C ^sk are calculated by Singular Value Decomposition

U^{sk} S^{sk} {(V^{sk})}^{T} = C^{sk},

where the matrix U ^sk contains the eigenvectors sorted according to the decreasing order of eigenvalues and the diagonal matrix S ^sk contains the respective eigenvalues. Let λ_n (where n = 1,2,...N×G) represent the eigenvalues in decreasing order. We select the subspace dimension (i.e. number of eigenvectors) so as to retain 90% energy and project the Contourlet coefficients to this subspace. If

U_{L}^{sk}

represents the first L eigenvectors of U ^sk then the subspace Contourlet coefficients at scale s and orientation k are given by

B^{sk} = {(U_{L}^{sk})}^{T} (A^{sk} - μ^{sk} p),

where p is a row vector of all 1’s and equal in dimension to μ^sk. Note that

U_{L}^{sk}

represents the subspace for Contourlet coefficients at scale s and orientation k. Similar subspaces are calculated for different scales and orientations using the training data and each time, the subspace dimension is chosen so as to retain 90% energy. In our experiments, we considered three scales and a total of 15 orientations along with the low pass sub-band image. Figure 3 shows samples of a sub-band image and Contourlet coefficients at two scales and seven orientations.

Fig. 3 Contourlet coefficients of a sample face.

Download Full Size | PDF

The subspace Contourlet coefficients were normalized so that the variance along each of the L dimensions becomes equal. This is done by dividing the subspace coefficients by the square root of the respective eigenvalues. The normalized subspace Contourlet coefficients at three scales and 15 orientations of each image are stacked to form a matrix of feature vectors B where each column is a feature vector of the concatenated subspace Contourlet coefficients of an image. The concatenated vectors may still have some redundancy therefore, these features are once again projected to a linear subspace. However, this time, the mean need not be subtracted as the features are already centered at the origin. Since the feature dimension is usually large compared to the size of the training data, BB ^T is very large. Moreover, at most N × G − 1 orthogonal dimensions (eigenvectors and eigenvalues) can be calculated for a training data of size N × G. The (N × G)th eigenvalue is always zero. Therefore, we calculate the covariance matrix C = B ^T B instead and find the N × G − 1 dimensional subspace as follows

U^{'} S V^{T} = C,

U = B U^{'} / \sqrt{diag (S)} .

In Eq. (6), each dimension (i.e. column of AU′) is divided by the square root of the corresponding eigenvalue so that the eigenvectors in U (i.e. columns) are of unit magnitude. The last column of AU′ is ignored to avoid division by zero. Thus U defines an N × G − 1 dimensional linear subspace. The feature vectors are projected to this subspace and used for classification.

F = U^{T} B

We tested Support Vector Machine (SVM) [10], image to sub-space distance and correlation coefficient for classification. Identification results were similar. However, the correlation coefficient gave much better verification results. Therefore, we report all results using this classifier. Correlation coefficient is defined with Eq. (8) where t and q are the subspace Contourlet coefficients of the target and query faces and n is the subspace dimension.

γ = \frac{n \sum t q - \sum t \sum q}{\sqrt{n \sum {(t)}^{2} - {(\sum t)}^{2}} \sqrt{n \sum {(q)}^{2} - {(\sum q)}^{2}}},

5. 3D face reconstruction

Although, we acquired 23 images per face in one session, only three images under arbitrary illumination were used to reconstruct the 3D face model. The first three dimensions of the subspace constructed from multiple images with different illumination of a convex object with Lambertian surface contain the shape information. Let I be a row matrix of three images (columns) under different illumination of the same face, we calculate C = I^T I instead of the II^T which is computationally expensive given the high dimensionality of the image. Note that the mean is not subtracted from the images such as in the PCA or eigenfaces method [2]. The eigenvectors are calculated as follows:

U^{'} S V = C,

U_{i} = \frac{1}{\sqrt{λ_{i}}} I U_{i}^{'},

where U_i is the ith eigenvector of II^T and λ_i is its ith eigenvalue. U′_i. is the ith column of U′. We use only the first eigenvector (i.e. i = 1) for calculating features in order to save computational time. The remaining two dimensions have negligible effect on the accuracy because the data is normalized in the xy dimensions. The first eigenvector of all training faces were projected to a PCA subspace. Let E represent the row matrix of the first eigenvectors (U ₁) of n training faces. Thus E has n columns and rows equal to the number of pixels in a training image. Since n is very small compared to the image size, we calculate the covariance matrix as follows:

C_{e} = \frac{1}{n} \sum_{j = 1}^{n} {(E_{j} - μ)}^{T} (E_{j} - μ),

where

μ = \frac{1}{n} \sum_{j = 1}^{n} E_{j} (j = 1 \dots 80)

. E_j corresponds to the jth column in E.

U_{e}^{'} S_{e} V_{e} = C_{e},

U_{j} = \frac{1}{\sqrt{λ_{j}}} (E - μ p) U_{j}^{'}, for j = 1 \dots (n - 1),

where p is a row vector of all 1’s, U_j is the jth eigenvector and λ_j is the jth eigenvalue of

\frac{1}{n} \sum_{j = 1}^{n} (E_{j} - μ) {(E_{j} - μ)}^{T}

. At most n − 1 eigenvectors/eigenvalues can be estimated from n sets of training images. Next E is projected to the above calculated PCA subspace

F = U^{T} E,

F is a row matrix of (n − 1) dimensional column vectors. We take the first 45 dimensions, corresponding to the highest eigenvalues, that retain 92.5% of the energy and normalize them so that the variation along each dimension is equal i.e. by dividing each dimension by the square root of the corresponding eigenvalue λ. Next, each vector (i.e. column) is normalized to unity to form the input patterns x for training Support Vector Regression.

For output labels, each training identity’s face is laser scanned to get an accurate 3D model. These models usually contain noise, spikes and holes. Each scan is preprocessed to remove spikes using neighbourhood constraints. A surface is then fitted to the data points using approximation which fills holes as well as extrapolates to cover missing surfaces at the boundary [29]. A smoothing factor is used to remove noise. The pose and scale of the 3D facial scans were normalized based on manually identified center of the eyes and lips. Range data has absolute scale but the sampling can be different from scan to scan. To ensure unambiguous reconstruction of 3D faces, we do not alter their scale but use variable sampling rates to normalize them. We resample the faces so that they have equal number of data points between the center of their eyes. The number of points were chosen equal to the number of pixels between the eye centers of the 2D images. Similarly, the number of data points between the mid-point of the eyes and the center of the lips was chosen to be equal to the corresponding number of pixels in the 2D images. This lead to different horizontal and vertical sampling rates of the 3D faces. The 3D faces were cropped so that the total number of data points equals the number of image pixels. The lower corners were masked out similar to the 2D faces and the z components (depth) of the 3D face data points were vectorized and concatenated with the horizontal and vertical sampling rates to form a feature vector. Note that using the sampling rates and depth values, the complete 3D face data points (with x,y,z coordinates) can be reconstructed to the exact scale. Figure 4 shows sample 3D faces from our database after normalization.

Fig. 4 Preprocessed 3D faces from our database and the average face (last).

Download Full Size | PDF

The shape of human faces resides in a very low dimensional subspace. We estimate this subspace from our training data (80 individuals) augmented by 3D faces of another 350 individuals from the FRGC v2.0 database [30]. Training 3D faces were normalized and put into a row matrix A of column vectors where each column represents a face. Since the number of faces is much smaller than its dimensionality, we calculate the 3D subspace using Eq. (11) to (13). We take the 13 dimensions corresponding to the largest eigenvalues (preserving 90% energy) and project the training 3D faces to this subspace using Eq. (14). Thus the 3D faces can be reconstructed up to 90% fidelity. Finally, the training faces are normalized so that the variation along each dimension is between −1 to +1.

We assume that the 13 dimensions are linearly independent and train a separate Support Vector Machine (SVM) for each dimension using the input patterns x. Support Vector Regression (SVR) using RBF kernel is used to learn a non-linear function that estimates one of the 13 co-efficients of the 3D face from the input patterns. During testing, novel images of database faces and images of unseen faces are fed to the trained SVRs for estimating the 13 coefficients which are then projected back to the 3D face space to get the vector of depth values and sampling rates. This vector is used to reconstruct the 3D face to the correct scale without GBR ambiguity.

6. Illumination invariant face recognition results

Three experiments were performed using our database, the extended Yale B and CMU-PIE databases. The number of images, subjects and illumination directions for the databases were {4347, 106, 23}, {1710, 38, 45}, {1344, 68, 21} respectively.

6.1. Experiment 1

We use five images per person for training and the rest for testing from our database to find the required number of subspace Contourlet coefficients. Figure 5 (left) shows that the recognition rate peaks with only 340 coefficients. Figure 5 (right) shows the effect of light direction on accuracy. The system is trained and tested with a single image (per person from different sessions) corresponding to the same illumination. Images (6 and 16) with frontal illumination yield the highest accuracy. Unlike the elevation angle, the azimuth angle of incident light (image 9 to 23) has a non-linear relationship with accuracy. A possible explanation is the non-linear nature of cast shadows that are more obvious under lateral illumination.

Fig. 5 Exp-1: Left: Recognition rate vs. the number of subspace Contourlet coefficients. Right: Recognition rates for individual images/illumination conditions (x-axis).

Download Full Size | PDF

6.2. Experiment 2

This experiment was performed using the first session of our database, the extended Yale B and the CMU-PIE databases. One or more images per person are used for training and the rest for testing. We avoid training with all image combinations and take one or two frontal with four to five laterally lit images as in [15]. Figure 6 and Table 1 show results on our database. Using 8 training images, the recognition and verification rate at 0.001FAR was 99.87%. Training with images {5,14,17,2,20,12,21,23} gave the best performance. For fewer training images, we removed images one by one in reverse order i.e. leaving images lit from smaller angles to the last. This is practically sensible because it requires less space or smaller LCDs or the subject can be positioned far from the LCD. Figure 7 and Table 2 show our results on the extended Yale B database. Figure 8 shows error rates for direct comparison with [15]. Our algorithm achieved 100% recognition and verification rate at 0.001FAR. On the CMU-PIE database [26], with just five training images, we achieved 100% recognition and verification rate at 0.01% FAR.

Fig. 6 Exp-2 results for our database. CMC (left) and ROC (right) curves for different number of training images. Recognition was performed on the basis of a single image.

Download Full Size | PDF

Fig. 7 Exp-2 results for the extended Yale B database. CMC (left) and ROC (right) curves for different number of training images.

Download Full Size | PDF

Fig. 8 Exp-2: Error rates on the extended Yale B database for direct comparison with [15].

Download Full Size | PDF

Table 1. Experiment 2 Results (in %) Using our Database

View Table | View all tables in this article

Table 2. Experiment 2 Results (in %) Using the Extended Yale B Database

View Table | View all tables in this article

6.3. Experiment 3

This setup is similar to Exp-2 except that the training and test images are taken from different day sessions which makes it more challenging. Only our database can be used here because in the Yale B and CMU-PIE databases, each subject was imaged in a single session only. The gallery size is 106 and the number of test images is 23 × 83 = 1909. Figure 9 and Table 3 show our results. We also test a different setup where the query contains multiple images per person as well. Training and test images are taken randomly from the first and second sessions respectively and the results are averaged. Figure 10 shows that a significant improvement is achieved with just two images and a maximum of 98.8% identification rate is achieved with five images.

Fig. 9 Exp-3 results for our database.

Download Full Size | PDF

Fig. 10 Average recognition/verification (at 0.001FAR) rates versus the number of images in a many to many matching approach. Standard deviation is shown as vertical lines.

Download Full Size | PDF

Table 3. Experiment 3 Results (in %) Using our Database

View Table | View all tables in this article

6.4. Timing and comparative analysis

With a Matlab implementation on a 2.4GHz machine with 4GB RAM, the training time (including Contourlet transform) using our database of 106 subjects and 6 images per subject was 2 minutes. The recognition time was 258 msecs. The average time for the Contourlet transform of a face at 3 scales and 15 orientations was 100 msecs and for matching two faces was 0.4 msecs. Table 4 compares our results to others including [31, 32]. The error rates for subsets 1 and 2 are zero for all. Our algorithm achieves the best performance with zero error on all subsets using the extended (larger) Yale B database and only 8 training images. We also achieved 100% accuracy on the CMU-PIE database using five training images compared to the 99.6% accuracy by Contourlet based filtering [18] using nine training images.

Table 4. Comparison on the Yale B (10 Subjects) and Extended Yale B Databases^{^a}

View Table | View all tables in this article

7. 3D face reconstruction results

Quantitative results are presented only for our database where the ground truth 3D faces are available. The first session of the first 80 subjects (appearing in two sessions) was used for training. The second session of these 80 training subjects and one session of the remaining 26 subjects were used for testing. The algorithm was implemented in Matlab and Joachims [10] C implementation was used for SVR. The average 3D face reconstruction time was 60 msec.

7.1. Experiment 4

Experiment 4 measures the 3D face reconstruction accuracy of database/seen individuals from their novel images (different session). Three images from the first session of 80 individuals were used to train the SVR and three images form the second session of the same individuals were used for testing. This scenario is significant because in face recognition, a query subject must be already enrolled in the database. Figure 11(a) and 11(b) show our qualitative results. The first row contains the original and the second contains the reconstructed 3D faces. The original 3D faces are texture mapped with images from the first session (i.e. training data) whereas the reconstructed 3D faces are texture mapped with images from the second session (test data).

Fig. 11 Ground truth (top) and reconstructed (bottom) 3D faces from (a)(b) Exp-4 and (c)(d) Exp-5. (e) Reconstruction error of database (Exp-4) and unseen faces (Exp-5).

Download Full Size | PDF

7.2. Experiment 5

The algorithm was trained on images from the first session of 80 individuals and tested on an unseen set of 26 individuals. Figures 11(c) and 11(d) show qualitative results. Since these individuals appeared in only one session, the ground truth and reconstructed 3D faces are both mapped with the same texture from test data. The texture from the laser scanner was not used in any experiment. In Fig. 11, one can appreciate the similarity between the reference and reconstructed 3D faces of the same individual as well as the dissimilarities between different individuals.

A quantitative comparison of Exp-4 and 5 is given in Fig. 11(e) which plots the percentage of faces against the maximum error in PCA space. A 100% error corresponds to the maximum possible difference between any two faces in the training data. For Exp-4 (previously seen faces), 90% reconstructed faces are within 12% error and almost all faces are within a maximum of 20% error. For Exp-5 (unseen faces), 90% faces are within 27% error and 80% faces are within a maximum of 20% error. Although the errors increase for unseen faces, the results are quite promising given the small training data. Table 5 gives the percentage errors along each PCA dimension. To calculate the errors between the reconstructed and their reference 3D faces (at 90% fidelity) in the Euclidean space, both were converted to pointclouds and the Euclidean distances between their nearest points were measured (without refining the registration for fairness). Figure 12 shows our results.

Fig. 12 Histogram of reconstruction errors in the Euclidean space. Left: Exp-4 (database faces). Right: Exp-5 (unseen faces).

Download Full Size | PDF

Table 5. Avg. Error (%) for Each Coefficient

View Table | View all tables in this article

7.3. Experiment 6

We analyze the effects when training and test images are taken under different arbitrary illuminations e.g. the system is trained on frames {2,4,6} (vertical scan) and tested on frames {14,16,18} (horizontal scan). Since $C_{3}^{23}$ is very large, we restrict frame selection to nearly frontal illuminations i.e. ten frames namely {2,3,4,5,6,14,15,16,17,18}. We avoid selecting consecutive frames (nearly similar illuminations) for either training or testing. With these restrictions, there are 23 possible combinations of 3 frames. We randomly select 20 mutually exclusive combinations for training and testing such that all the 23 possibilities are used. The average error and standard deviation along each dimension are given in Table 6. Note that there is negligible difference between the accuracy of Exp-4 and Exp-6. This supports our argument that the LCD screen does not need calibration and the position of the subject need not be strictly controlled in our approach. Figure 13 shows 3D faces reconstructed from the images of the Yale B database. In the absence of ground truth, only qualitative results are presented.

Fig. 13 Top: Ground truth images of the first five subjects in the Yale B database (pose 7). Middle: 3D reconstructed faces rotated so that their pose is approximately equal to the top row for comparison. Last: The 3D faces are further rotated to show their profile.

Download Full Size | PDF

Table 6. Exp-6: Average Error and Standard Deviation (%) for Each Coefficient Using 20 Combinations of Training and Test Images

View Table | View all tables in this article

8. Conclusion

We presented a novel algorithm that exploits desktop optics for illumination invariant face recognition. By projecting different scale and orientation features to separate linear subspaces before combining them, we preserve the most discriminating features rather than the most varying ones. Note that the most varying features are not necessarily the most unique. We used extended light sources to model illumination variations and achieved promising results. Our algorithm outperforms state-of-the-art on the extended Yale B and CMU-PIE databases. We also presented an efficient model-based approach for 3D face reconstruction from three images under arbitrary illumination. The 3D faces were reconstructed to the correct scale without GBR ambiguity. We presented quantitative results by comparison with ground truth 3D faces acquired with a laser scanner, something that is missing in the existing literature. Finally, we contribute a novel database comprising multiple facial images under varying illumination and their corresponding 3D face models acquired with a laser scanner. This database is unique and will be publicly available to evaluate face recognition and 3D face reconstruction algorithms.

Acknowledgments

Thanks to M. Do for the Contourlet Toolbox, the FRGC organizers [30] for data, R. Owens for useful suggestions and the anonymous participants in our experiments. This research was supported by ARC grants DP0881813 and DP110102399.

References and links

1. W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: A literature survey,” ACM Comput. Surv. 35(4), 399–458 (2003). [CrossRef]

2. M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cogn. Neurosci. 3, 71–86 (1991). [CrossRef]

3. P. Belhumeur, J. Hespanha, and D. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection,” IEEE Trans. Pattern Anal. Mach. Intell. 19, 711–720 (1997). [CrossRef]

4. L. Wiskott, J. Fellous, N. Kruger, and C. Malsgurg, “Face recognition by elastic bunch graph matching,” IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 775–779 (1997). [CrossRef]

5. O. Arandjelovic and R. Cipolla, “Face recognition from video using the generic shape-illumination manifold,” in Proceedings of European Conference on Computer Vision (Springer, 2006), pp. 27–40.

6. K. Lee and D. Kriegman, “Online probabilistic appearance manifolds for video-based recognition and tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2005), pp. 852–859.

7. L. Liu, Y. Wang, and T. Tan, “Online appearance model learning for video-based face recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2007), pp. 1–7.

8. J. Tangelder and B. Schouten, “Learning a sparse representation from multiple still images for on-line face recognition in an unconstrained environment,” in Proceedings of International Conference on Pattern Recognition (IEEE, 2006), pp. 10867–1090.

9. M. Do and M. Vetterli, “The Contourlet transform: an efficient directional multiresolution image representation,” IEEE Trans. Image Process. 14(12), 2091–2106 (2005). [CrossRef] [PubMed]

10. T. Joachims, “Making large-scale SVM learning practical,” in Advances in Kernel Methods, (MIT-Press, 1999), pp. 169–184.

11. P. Belhumeur and D. Kriegman, “What is the set of images of an object under all possible illumination conditions?,” Int. J. Comput. Vision 28(3), 245–260 (1998). [CrossRef]

12. A. Georghiades, P. Belhumeur, and D. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intell. 6(23), 643–660 (2001). [CrossRef]

13. P. Hallinan, “A low-dimensional representation of human faces for arbitrary lighting conditions,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 1994), pp. 995–999. [CrossRef]

14. R. Basri and D. Jacobs, “Lambertian reflectance and linear subspaces,” IEEE Trans. Pattern Anal. Mach. Intell. 25(2), 218–233 (2003). [CrossRef]

15. K. Lee, J. Ho, and D. Kriegman, “Acquiring linear subspaces for face recognition under variable lighting,” IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 684–698 (2005). [CrossRef] [PubMed]

16. Y. Schechner, S. Nayar, and P. Belhumeur, “A theory of multiplexed illumination,” in Proceedings of IEEE International Conference on Computer Vision (IEEE, 2003), pp. 808–815. [CrossRef]

17. W. R. Boukabou and A. Bouridane, “Contourlet-based feature extraction with PCA for face recognition,” in Proceedings of NASA/ESA Conference on Adaptive Hardware and Systems (IEEE, 2008), pp. 482–486. [CrossRef]

18. Y. Huang, J. Li, G. Duan, J. Lin, D. Hu, and B. Fu, “Face recognition using illumination invariant features in Contourlet domain,” in Proceedings of International Conference on Apperceiving Computing and Intelligence Analysis (IEEE, 2010), pp. 294–297. [CrossRef]

19. A. Mian, “Face recognition using Contourlet transform and multidirectional illumination from a computer screen,” in Proceedings of Advanced Concepts for Intelligent Vision Systems(Springer, 2010), pp. 332–334. [CrossRef]

20. K. Bowyer, K. Chang, and P. Flynn, “A survey of approaches and challenges in 3D and multi-modal 3D + 2D face recognition,” Comput. Vis. Image Und. 101, 1–15 (2006). [CrossRef]

21. D. Scharstein, R. Szeliski, and R. Zabih, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” Int. J. Comput. Vision 47, 7–42 (2002). [CrossRef]

22. V. Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,” IEEE Trans. Pattern Anal. Mach. Intell. bf 25 , 1063–1074 (2003). [CrossRef]

23. G. Schindler, “Photometric stereo via computer screen lighting for real-time surface reconstruction,” in Proceedings of International Symposium on 3D Data Processing, Visualization and Transmission (IEEE, 2008).

24. N. Funk and Y. Yang, “Using a raster display for photometric stereo,” in Proceedings of Canadian Conference on Computer and Robot Vision (IEEE, 2007), pp. 201–207.

25. J. Clark, “Photometric stereo using LCD displays,” Image Vis. Comput. 28(4), 704–714 (2010). [CrossRef]

26. T. Sim, S. Baker, and M. Bsat, “The CMU pose, illumination, and expression database,” IEEE Trans. Pattern Anal. Mach. Intell. 25(12), 1615–1618 (2003). [CrossRef]

27. P. Viola and M. Jones, “Robust real-time face detection,” Int. J. Comput. Vision 57(2), 137–154 (2004). [CrossRef]

28. L. Shen and L. Bai, “A review on Gabor Wavelets for face recognition,” Pattern Anal. Appl. 19, 273–292 (2006). [CrossRef]

29. J. D’Erico, “Surface fitting using Gridfit,” http://www.mathworks.com/matlabcentral/fileexchange/.

30. P. Phillips, P. Flynn, T. Scruggs, K. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and W. Worek, “Overview of the face recognition grand challenge,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2005), pp. 947–954.

31. T. Chen, W. Yin, X. Zhou, D. Comaniciu, and T. Huang, “Total variation models for variable lighting face recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 28(9), 1519–1524 (2006). [CrossRef] [PubMed]

32. X. Tan and B. Triggs, “Enhanced local texture feature sets for face recognition under difficult lighting conditions,” in Proceedings of IEEE International Workshop on Analysis and Modeling of Faces and Gestures (IEEE, 2007). [CrossRef]

	Number of training images
	1	2	3	4	5	6	7	8

Error rate	14.41	3.10	1.32	0.89	0.63	0.66	0.18	0.13
Recognition rate	85.59	96.90	98.68	99.11	99.37	99.33	99.82	99.87
Verif. rate at 0.1% FAR	79.46	95.69	98.76	98.76	99.37	99.67	99.87	99.87

Training images	error rates for subset				recog. rate	verif. rate at 0.1% FAR	FAR at 100% recog. rate
Training images	1&2	3	4	total	recog. rate	verif. rate at 0.1% FAR	FAR at 100% recog. rate

8	0	0	0	0	100	100	0.1
7	0	0.22	0.56	0.28	99.72	99.44	1.39
6	0	0	1.31	0.48	99.52	99.25	3.78
5	0.42	0	2.44	1.06	98.94	98.08	4.16
4	0.42	0	10.15	3.69	96.31	94.24	74.89
3	1.39	0	22.0	8.02	91.98	88.64	98.86
2	1.80	19.08	44.55	20.78	79.22	70.90	98.39
1	8.86	60.75	86.09	48.13	58.17	36.81	98.59

	Number of training images
	1	2	3	4	5	6	7	8

Error rate	36.83	15.14	10.58	9.32	6.02	4.06	3.35	3.35
Recognition rate	63.17	84.86	89.42	90.68	93.98	95.91	96.65	96.65
Verif. rate at 0.1% FAR	37.45	70.40	77.06	83.55	90.10	92.51	93.29	94.34

Method	number of subjects	Error (%) on subset
Method	number of subjects	3	4	total

Eigen face w/o 3 [15]	10	19.2	66.4	25.8
Cones-attached [12, 15]	10	0.0	8.6	2.7
Harmonic images-cast [14, 15]	10	0.0	2.7	0.85
Contourlet based filtering [18]	10	0.0	2.9	-
9 Points of light [15]	10	0.0	0.0	0.0
Log. total variation [31, 32]	38	1.6	1.1	-
Local texture features [32]	38	0.0	0.8	-
This paper	38	0.0	0.0	0.0

	Coefficient number
	1	2	3	4	5	6	7	8	9	10	11	12	13

Exp-4	6.1	7.4	6.6	5.9	6.1	5.6	6.8	6.3	6.1	4.9	6.3	6.2	6.8
Exp-5	12.3	18.3	13.0	11.3	12.5	11.9	11.4	13.4	13.5	10.1	13.5	11.1	16.6

Illumination invariant recognition and 3D reconstruction of faces using desktop optics

Abstract

1. Introduction

2. Literature review

2.1. Illumination invariant face recognition

2.2. 3D face modeling using images/the computer screen as illuminant

3. Data acquisition

4. Subspace feature representation and classification

5. 3D face reconstruction

6. Illumination invariant face recognition results

6.1. Experiment 1

6.2. Experiment 2

6.3. Experiment 3

6.4. Timing and comparative analysis

7. 3D face reconstruction results

7.1. Experiment 4

7.2. Experiment 5

7.3. Experiment 6

8. Conclusion

Acknowledgments

References and links

Cited By

Figures (13)

Tables (6)

Equations (14)

Optics Express

	Coefficient number
	1	2	3	4	5	6	7	8	9	10	11	12	13

Avg. err.	6.1	7.8	7.0	6.3	6.4	5.7	7.2	6.7	6.5	5.1	6.8	6.4	7.2
Std.	0.4	0.3	0.2	0.3	0.2	0.2	0.3	0.3	0.2	0.2	0.2	0.3	0.3