Light invariant photometric stereo

Yuval Braun; Hugo Guterman; Hugo Guterman

doi:10.1364/OE.477180

1. Introduction

Three-dimensional reconstruction by photometric stereo (PS) has many advantages over other techniques. It enables high-resolution recovery, is simple to implement, requires only one camera, and allows real-time reconstruction [1]. However, the classical method proposed by Woodham [2], suffers several limitations that make it impractical in real-world scenarios. It assumes a Lambertian surface, orthographic camera model, highly calibrated lighting system, and ideal conditions without background light. Works published in recent years allow the method to be adapted to more realistic scenarios such as uncalibrated light directions [3], non-Lambertian surfaces [4], a perspective camera model [5], and unsynchronized light sources [6]. Despite these advances, most methods still do not address the PS sensitivity to sensor noise and ambient light.

There are two primary kinds of noise that influence the PS performance. The first and more significant is ambient light. The classical PS approach [2], assumes that a sequence of three or more images of a specific scene is captured from the same viewpoint and under varying illumination. Based on the intensity variation of each pixel, we can estimate the local orientation of the surface that projects onto that pixel. When the scene is also illuminated by ambient light, the assumption that only one light source is illuminating the surface while each frame is taken is incorrect. This false assumption leads to an error in normal vector reconstruction, leading to an inaccurate 3D shape. A traditional way to deal with the background light is to take an additional frame that contains only the background light without the light sources of the system. The extra frame can be subtracted from the other frames to get an image without the background light. This approach, called Flash\No Flash (FNF), fails when the light is not constant, such as with fluorescent lighting.

The second source of error is sensor noise that appears even in dark environment. This noise includes thermal noise, shot noise, and quantization error. Assuming that it is additive and has zero mean, the sensor noise can be reduced by taking a high number of frames and averaging them. This technique requires capturing a large number of frames that grow linearly with the number of sources and, as a result, prolongs the time necessary for capturing an image.

Most photometric approaches assume that each image is taken while only one light source is active. This assumption requires full synchronization between the camera and the light sources. Previous works [6–8], deal with the need for synchronization by separating the sources based on different wavelengths. These methods use three different monochromatic light sources (red, green, and blue) that illuminate the object simultaneously, and the separation is done by the RGB values of a single image. However, this approach has some limitations. First, expanding it to more than three light sources requires an expensive multispectral camera. Second, since the surface's albedo is different for each wavelength, the reconstruction of the surface is more complicated and requires more assumptions about the object. Last, this method is highly sensitive to ambient light, even more than the traditional PS.

In this paper, a new approach for PS based on modulated light sources is presented. Similar to AM radio, knowing the specific frequency of the modulated light source allows us to filter out any other light in the scene. In this way, an image that depends only on a specific light source while ignoring the background light can be created. Moreover, if multiple sources with different frequencies illuminate the scene, the same principle can be used to separate several modulated signals in the frequency domain. Meaning, separating one video sequence into several corresponding images. These images are then used as an input to the PS algorithm.

The main contributions of this work are:

1. Suggest an alternative approach to color-based unsynchronized PS, based on modulated light sources.
2. Explain how LIPS can reduce both ambient light and sensor noise.
3. Show that LIPS outperforms classic and FNF PS, especially in environment with dynamic ambient light.

This paper is organized as follows. Section 2 presents relevant work of unsynchronized PS and PS under ambient light. In Section 3, the mathematical formulation of the proposed approach is provided. Section 4 discusses the two sources of noise in the image and their influence on PS. Section 5 evaluates the proposed method compared to the classical PS and FNF PS in different lighting conditions. The evaluation is done on both synthetic and real-world data. Conclusions and future work are presented in section 6.

2. Related work

PS has been studied for over four decades, starting with the pioneering work by Woodham [2]. Woodham showed that assuming a Lambertian surface and at least three images taken with different and distant light sources, it is possible to recover the surface orientation for each pixel. Although new approaches expand the solution for non-Lambertian surfaces [4], a perspective camera model [5], unknown light directions [3], and near light sources [9], most of them are still highly sensitive to ambient light and image noise. In addition, a high level of synchronization between the camera and the sources is needed for creating several images that correspond with only one light source.

2.1 Unsynchronized PS

The traditional PS is based on separating the light sources by taking each image while only one light is active. Hernandez et al. [8], suggested using light sources with different wavelengths, i.e., colors, to separate the sources. They used red, green, and blue light sources and created three corresponding images from a single RGB frame. Thus, they eliminated the need for synchronization between the sources and achieved 3D video at a rate of 60 frames per second. However, the fact that the surface albedo is different for each wavelength leads to three unknown albedos per pixel, instead of one in the classical PS. For this reason, most RGB-based PS requires initial assumptions about the surface, such as constant albedo [10], or capturing additional information [8]. In later work, Chakrabarti and Sunkavalli [6], reported RGB-based PS that can deal with spatially-varying albedo by assuming local constant albedo surface patches. Still, these methods all suffer from high sensitivity to ambient light.

As an alternative to spectral-based PS, Herrnsdorf et al. [11], suggested using modulated light sources for frequency division multiple access PS. Similar to this work, each light source is modulated sinusoidally at a different frequency. Using the Power Fourier Transform for the time-sequence intensity for each pixel, they identified the contribution of each light source to the measured brightness. They used frequencies in the range of 3 to 13 Hz. The result of such low frequencies is a strong visual flicker effect that can be unpleasant to the human eye. In a later work [12], the authors reported that by using Manchester-encoded binary frequency division multiple access instead of sinusoidal modulation, the visual flicker effect can be reduced.

The first difference between the proposed approach and Herrnsdorf et al. [11], is that a higher frame rate camera was that enabled us to work with higher modulation frequencies. The higher rate was feasibility because of the different way we controlled the lighting. The system presented by Herrnsdorf et al. [11], used a pseudo-random bit sequence with sinusoidal duty cycle modulation, i.e., the probability of transmitting a’ 1’ was varied sinusoidally. That is, the LED intensity was varied according to on-off modulation and the sine wave received by the camera's long exposure time. In comparison, we used LED drivers that allow us to change the intensity of the LEDs according to analog values and, as a result, not be limited to long exposure time and low frame rate. Thus, higher frequencies could be employed with a flicker effect invisible to the human eye. Second, in our approach, the Fourier Transform is performed only once for the whole frames sequence and not per pixel. i.e., the computational complexity (flop/frame) of the proposed method is $O(N )$, where N is the number of frames, compared to $O({NlogN} )$ in [11]. Last, the authors of [11,12], focused on the lack of synchronization between the camera and the light sources, enabling simple installation in public areas. Inspired by Light Invariant Video Imaging [13], we focus on the ability of the system to filter out ambient lights and perform well in an Illuminated environment.66

2.2 PS with ambient light

The classical PS approach assumes that the surface is illuminated by only one known light source in each frame. When the image is taken outside the laboratory in a real-world scenario, ambient light can fail this assumption. Previous works [14–16], investigated the area of PS in a naturally lit environment. However, most of these methods use the sun [15], or general ambient light [16], as an alternative to controlled light sources. Although some of these approaches require only a camera and a reference object [16], and no additional light source, their accuracy is limited compared to PS with separated known light sources. PS with ambient light can also be performed by adding the ambient illumination term to the classic pixelwise PS equation. In this manner, the linear system of equations contains four unknowns, the scaled vector components, and the ambient light term. This strategy is common in PS approaches [17], but it makes the solution more complicated and requires at least four sources instead of three in the classic PS.

The presented work is closely related to [18], which reported calibrated PS in ambient light by strictly obtaining the depth map using the ratios of image differences. In contrast, we filter out the background light to perform well-defined calibrated PS in an alight environment. Thus, the proposed method can easily improve any modern PS approach that uses separate images as input and neglects the ambient light. To the best of our knowledge, the only work that focuses on separating the controlled lights and the ambient light is [19], where Gu et al. used light sources with high-frequency sinusoidal patterns shifted over time. Contrary to [19], the proposed approach uses non-synchronized light without spatial patterns.

2.3 Light invariant video imaging

The principle of using modulated light sources to filter out the ambient light was first introduced by Kolaman et al. in light invariant video imaging (LIVI) [13]. The authors reported a system that enables the production of images that are not affected by background lighting. By using one modulated light source and a high frame rate camera, this method produces a video independent of background lighting. According to [20], for a single source, AM light intensities surpassing 20 percent of background light level gave appreciable results. Later works showed that using LIVI can improve the performance of computer vision algorithms, such as underwater color reconstruction [20], and convolutional neural networks [21]. Here, the principle of LIVI is extended to more than one modulated light to enable separation between the sources and not only filter out the background light.

3. Separation of modulated lights

Let us assume that M different modulated light sources illuminate the scene ${L_i},i \in \{{1,2,\ldots ,M} \}$ plus an unknown background source. Every modulated light source ${L_i}$ can be formulated as:

(1)$${L_i} = {c_i} + {a_i}\cos ({2\pi {f_i}t} ),$$

where t represents time, ${c_i}$ is the constant intensity over time, and ${a_i}$ is the amplitude of the modulated signal with frequency ${f_i}$. According to [13], the total light in the scene is reflected by the object patch and generates a radiance that satisfies the following:

(2)$$R(t) = C + {R_b}(t) + \sum\limits_{i = 1}^M {{A_i}\cos ({2\pi {f_i}t} )},$$

where C depends on the patch reflectance and the constant parts of all the light sources (background and modulated), ${A_i}$ depends on the patch reflectance and the amplitude ${a_i}$ from Eq. (1), and ${R_b}(t )$ depends on the dynamic background lights. The radiance $R(t )$ is sampled times by an image pixel at discrete time, as:

(3)$$\begin{array}{l} r[n] = C + {R_b}({n{T_s}} )+ \sum\limits_{i = 1}^M {{A_i}\cos ({2\pi {f_i}n{T_s} + {\phi_i}} )} ,\\ n \in \{ 1,2,\ldots ,N - 1\}, \end{array}$$

where ${T_s}$ is the sample time of the camera (frames per second (FPS)) and ${\phi _i},i \in \{{1,2,\ldots ,M} \}$ are the unknown phase differences between the modulated lights and the camera. Since the frequencies ${f_i}$ and the sample time ${T_s}$ are known, all the amplitudes ${A_i}$ can be reconstructed by using a finite impulse response (FIR) filter:

(4)$$\widehat {{A_i}} = \left|{\frac{2}{N}\sum\limits_{n = 0}^{N - 1} {r[n]{e^{ - j2\pi {f_i}{T_s}n}}} } \right|.$$

By applying this filter on every pixel in the scene, an image that contains only the desired modulated light source is obtained. The purpose of the absolute value is to eliminate the unknown phase term ${e^{j{\phi _i}}}$. Repeating this process $M$ times, once for every modulated light source, produces $M$ corresponding images (${I_i},i \in \{{1,2,\ldots ,M} \}$) that are used as an input to the photometric stereo. This procedure is presented in Fig. 1.

Fig. 1. An example of the proposed model architecture. N frames captured while all the three modulated sources are active. Three images corresponding to M = 3 different sources are reconstructed by filtering the specific frequencies from the N capturing frames. The images are used as an input to the Photometric Stereo algorithm to obtain the normal map.

Download Full Size | PDF

Every normal map produced by the algorithm requires $N$ input frames. Thus, assuming the camera frame rate is ${f_s} = 1/{T_s}$, the output frame rate is ${f_{3D}} = {f_s}/N$. Therefore, for applications that require 3D video, we will prefer the minimal N that enables appreciable results. In addition, since the computation and memory complexity of the algorithm is $O(N )$, a large N might not be suitable for real-time applications. For a given ${f_s}$, the maximum 3D imaging frame rate ${f_{3D}}$ is depends on the choice of modulated frequencies. The frequencies will be orthogonal if they are all integer multiples off the output frame rate ${f_s}/N.$ Therefore, assuming $M$ modulated frequencies are equally spaced in a frequency band between ${f_{min}}$ and ${f_{max}}$, i.e., with intervals of $({{f_{\textrm{max}}} - {f_{\textrm{min}}}} )/({M - 1} )$ Hz, the maximum achievable 3D rate is:

(5)$${f_{3D}} = \frac{{{f_s}}}{N} = \frac{{{f_{\max }} - {f_{\min }}}}{{M - 1}}.$$

However, this analysis ignores the ambient light, which has an unknown spectrum. It also does not take into account sensor noise, the inaccuracy of the frequencies generation and the sampling rate in a real-world implementation. Hence, a higher $N$, that corresponds to a lower frame rate ${f_{3D}}$, will produce better results due to better filtering out sensor noise and unwanted harmonics caused by the above described issues.

4. Ambient light and sensor noise

Ambient light can be divided into two types: constant and dynamic. While constant ambient light can be easily removed by the FNF method, dealing with dynamic background light is more challenging. Dynamic ambient light can originate from various sources such as intermittent blocking of the light source, flickering light sources, and even sea waves in underwater imaging. Flickering light sources, such as fluorescent or tungsten lights, oscillate at 100/120 Hz, depending on the frequency of the electric grid. To separate the ambient light from the system's controlled lights, frequencies that differ from 100 Hz, which is twice the frequency of the electric grid in Israel, were chosen for the modulated sources. In the case of a square wave with a frequency f, amplitude $1/2$ and an offset of $1/2$, the signal can be represented as a Fourier series:

(6)$$s(t) = \frac{1}{2} + \frac{2}{\pi }\sum\limits_{k = 1}^\infty {\frac{{\sin ({2\pi (2k - 1)ft} )}}{{2k - 1}}},$$

i.e., the square wave includes harmonics with frequencies of $({2k + 1} )f$ when $k$ is a natural number. If one or more of these harmonics overlap with the modulated lights frequencies, the reconstruction will include a term related to ambient light. Thus, an ideal selection of the modulated light frequencies is a set that does not overlap with the harmonics.

Even in an ideal environment, without any background light, there is still noise associated with the sensor. The noise comes from various sources and depends on the electronic components of the camera, the exposure time, the dynamic range, and more. Sensor noise can be categorized as thermal noise, shot noise, and quantization noise [22,23]. Thermal noise, or Johnson noise, depends on the sensor's temperature and follows Gaussian statistics. Shot noise is the statistical noise associated with the discrete nature of photons at the pixel. Since photon measurement obeys Poisson statistics, shot noise can be modeled as a Poisson process. Quantization noise is related to the error generated by rounding the real signal to n-bit representation. It is a uniform distributed noise that depends on the signal value and the pixel bit depth (e.g., 8-bit or 16-bit). In practice, it is common to model the whole sensor noise as an additive zero-mean Gaussian noise with some variance ${\sigma ^2}$. The noise can be reduced by taking the mean of several frames. However, this requires a large number of frames whose numbers grow linearly along with the number of sources, and as a result, lengthen the time required for capturing images. The influence of sensor noise on PS accuracy was first described in an article by Ray et al. [24], which identifies possible error sources in a classic PS. Later [25], Drbohlav and Chantler showed that for standard PS setup with M sources and sensor noise of zero mean and variance of ${\sigma ^2}$, the minimum uncertainty of the scaled normal vector is $9{\sigma ^2}/M$.

5. Experiments

5.1 Synthetic data

To evaluate the reconstructed surface accuracy quantitatively, we rendered synthetic data with the physically based renderer Mitsuba [26]. Two publicly available 3D objects, Stanford BUNNY, and CALIGULA [27], were rendered under orthographic projection. Camera resolution was set to 512 × 512 and frame rate to 400 frames per second. Similar to [28], zero-mean Gaussian noise with standard deviation $\sigma = 0.8\%$ of the full grayscale range was added to each frame independently. The frames were converted to 8-bit grayscale representation to resemble the sensor in the real-world experiment. The experimental setup consisted of eight directional light sources placed in a circle around the camera spaced equally in tilt by $360/M = 45$ degrees and directed at the axis origin. The irradiance of the ${i^{th}}$ source was set to vary according to:

(7)$${L_i} = \frac{1}{4} + \frac{1}{4}\cos ({2\pi {f_i}t + {\phi_i}} ),$$

where ${f_i}$ is the source frequency, $t$ is the time in seconds, and ${\phi _i}$ is a random phase with uniform distribution in the interval $[{0,2\pi } )$. To meet the Nyquist critera, the modulated sources frequencies must be lower than half the frame rate, e.g., 200 Hz. In addition, for most viewers, the human eye is sensitive to flickering light with a frequency lower than 60 Hz [29]. Thus, the frequencies of the sources were chosen in the range of 76 to 185 Hz, with equal intervals between the frequencies. To simulate static ambient light, the objects were placed on two environment maps: Pisa and Glacier [30]. The environment map intensity was varied as a square wave with different frequencies for dynamic ambient light. We assumed there to be no inter-reflections. A sequence of 400 frames was rendered and processed by eight different FIR filters to eight corresponding images. The MATLAB toolbox PSbox [31], was used to perform classic PS with these eight images. In addition, the scenes’ ground truth normal maps were obtained. The angular error e of each pixel $({x,y} )$ was computed using the estimated and the ground truth normals maps and the following formula:

(8)$$e(x,y) = {\cos ^{ - 1}}({{n_e}(x,y) \cdot {n_{gt}}(x,y)} ),$$

where ${n_e}$ and ${n_{gt}}$ are the estimated and ground-truth normal maps, respectively.

For comparison, we also show the results of classic PS performed in the same conditions with constant sources separated in the time domain instead of modulated sources. Regarding the scenes with ambient light, an extra image was taken with only the ambient light. The additional image was subtracted from the other images to perform the FNF method. As mentioned in section 4, the average frame of a sequence of several frames can reduce the sensor noise. Table 1 presents the relation between the number of frames averaged for every image and the angular error of classic PS taken in a dark environment with sensor noise only. The results show that the more frames are taken, the lower the mean and median angular error. For a fair comparison with our approach, from this point forward, every image in the classical PS experiment was achieved by averaging $400/M = 50$ frames and $400/({M + 1} )= 45$ frames for FNF PS, when the $M + 1$ is related to $M$ sources and extra image for no flash.

Table 1. Mean and median angular error as a function of frames taken per image

View Table | View all tables in this article

Figure 2 shows normal map recovery results along with ground truth and error map in a dark environment and under the lighting condition PISA. In a dark environment all three methods perform similarly, with a slight advantage to the classic PS due to greater sampling (50 frames per image for classic PS compared to 45 for the FNF). It also shows that in constant ambient light, the mean error increases to 26.47° and the median to 25.87° for the classic PS without FNF. In contrast, the constant ambient light barely influences LIPS and the FNF.

Fig. 2. Normal map reconstruction and angular error of Bunny in a dark environment and under constant environment Pisa.

Download Full Size | PDF

Figure 3 presents the mean angular error as a function of the flickering ambient light frequency for Caligula under the Glacier environment map. For FNF, the minimum error is obtained at 80 Hz. The reason is that for 45 frames at 400 FPS, the average process is done over exactly nine periods of the 80 Hz square wave. At other frequencies, when the interval contains a non-integer number of the ambient light periods, the average no flash image does not correctly represent the ambient light signal's average. In general, the mean error of the FNF approach increases at the lower frequencies. Since the sampling time is constant, the averaging process is done over fewer periods when the ambient light frequency is low. Thus, the relative error of the average value is larger, and the FNF method fails to remove the ambient light. Consequently, the FNF method needs to be performed with a higher number of frames and a longer capturing time to deal with low-frequency ambient light.

Fig. 3. The mean angular error as a function of the ambient light frequency. The results are related to Caligula under the Glacier environment map. The lower frequency presented is 1Hz.

Download Full Size | PDF

In comparison, the angular error in our approach barely depends on the ambient light frequency and stays around 3.17°. However, at 1 Hz, the error rises to 3.58°. Since the square wave can be represented as an infinite sum of sinusoidal waves, and some of these harmonics can be close to the modulated sources frequencies, the reconstructed signal includes a term related to the ambient light. This phenomenon is shown in Fig. 4. As a result, the obtained normal map suffers from an error associated with this term, and the error here is higher than the case of other frequencies. Figure 5 shows normal map recovery results of Bunny and Caligula under the lighting condition Pisa and Glacier, respectively. The environment maps were varied as 10Hz square waves.

Fig. 4. Single-sided amplitude spectrum of the average pixel signal $y(t )$. The eight peaks on the right originate from the modulated sources frequencies. The other peaks are the harmonics related to the 1 Hz square wave.

Download Full Size | PDF

Fig. 5. Normal map reconstruction and angular error of BUNNY and CALIGULA under dynamic ambient light. The environment maps intensity varied according to 10 Hz square waves.

Download Full Size | PDF

The results show that LIPS performs better than FNF in periodic ambient light. However, real-world scenarios might include dynamic and non-periodic ambient light caused by unexpected changes in environmental illumination, additional light sources appearing in the scene, or moving light sources. For non-periodic ambient light, the environmental map was varied according to pseudo-random binary series. The ambient light intensity was changed in intervals of 0.1 seconds, corresponding to a PN sequence generated by a linear-feedback shift register. In addition, the simulations performed again with the previous setup while instead of the flickering environmental map, the dynamic ambient light was achieved by an additional moving source. The additional source moved at constant velocity in a semicircular orbit around the camera. Table 2 summarize the results for all the synthetic data experiments. The FNF failed to remove the dynamic ambient light in both pseudo random ambient light and moving object while our method succeeded. As a result, the mean and median angular error of the normal map obtained by the FNF PS increased while it barely changed for LIPS.

Table 2. Results of synthetic data experiments

View Table | View all tables in this article

5.2 Real-world data

A photometric stereo scanner prototype system was built using off-the-shelf products. We used a camera with a Sony IMX 174 sensor manufactured by Matrix-Vision Gmbh (mvBlueFOX3-2). The camera frame rate was set to 398 frames per second, and the camera resolution was 640 × 480. The camera captured raw data and saved the frames in motion jpeg2000 format that used jpeg compression on each frame separately - with no inter-frame coding. The separate frame compression was necessary because of the fast-changing light intensities across the frames.

Three 10,000lm LEDs were used as light sources. The modulated signal was generated by a digital system implemented on an Altera DE1 Board with Altera Cyclone 2 FPGA. The output of the FPGA was three PWM waves with duty cycles varying as sine waves. These three PWM waves were the inputs to three DC1666A buck-boost prototype drivers, which controlled the LED currents. The LEDs were placed around the camera attached to a metal frame to create multiple light directions. The modulation frequencies were set to: 90.8, 115.6, and 141.3 Hz. A chrome sphere was used to find the light directions. To allow easy background removal, the observed object was placed on a green screen.

The camera captured $N = 398$ frames. By taking the mean intensity value of each frame, a sequence of N values can be used to analyze the frequencies in the scene. The frequency-domain of this signal is presented in Fig. 6, where high peaks were observed on the exact pre-determined frequencies. Three FIR filters with pre-determined frequencies were used to obtain three corresponding images. These images were used as inputs to the PS algorithm. In addition, a classical PS with the FNF method was also performed for the same setup. The exposure time was set to $300\mu s$ for the FNF PS. Since LIPS assumes that all the LEDs are active simultaneously, the exposure time was reduced to $150\mu s$ to avoid saturation of the pixels. Similar to the synthetic experiment, every image in the classical PS experiment was achieved by averaging $398/({M + 1} )= 100$ frames.

Fig. 6. Single-sided amplitude spectrum of the average pixel signal for the real-world experiment. The three peaks are on the preset frequencies: 90.8, 115.6, and 141.3 Hz.

Download Full Size | PDF

LIPS and FNF were tested under four different lighting conditions: a dark environment, constant ambient light, periodic ambient light, and non-periodic ambient light. Since high-intensity 10000 lm LEDs were used, the natural ambient light barely influences the results. Thus, to simulate strong background lighting, the scene was illuminated directly by an extra flashlight (Mares Eos 15RZ), in addition to the three light sources. The periodic ambient light was achieved by setting the extra flashlight to flicker at 10Hz. The non-periodic pseudo-random ambient light was achieved by another flashlight (UltraFire WF-502B CREE XM-L2 1000LM 5-Modes), with an S.O.S mode, enabling it to emit Morse code for S.O.S. Because of the lack of a precise normal map reference, ground truth was defined as the results of classic PS performed with the same setup in a dark environment and with 1000 frames per image. Although the evaluation here is not relative to real ground truth, it does show well the effect of background lighting and sensor noise on PS performance. Table 3 summarizes the results of all the real-world experiments, with 100 frames per image for the FNF (total of 400) and 398 frame for LIPS.

Table 3. Results of the real-world experiments

View Table | View all tables in this article

In an ideal dark environment, the FNF performs better than LIPS. This will be explained in section 5.3. In constant light, both methods reached a similar angular error. However, in pseudo-random ambient light, the angular error of the FNF increases to a mean of 6.22° and a median of 4.82°. In contrast, LIPS performance was barely affected by pseudo-random ambient light. The FNF error also increased in the periodic ambient light to a mean of 3.12° and median of 3.09°. For LIPS, The angular error also increased in periodic dynamic light, but to a much smaller degree compared to the FNF.

The results of the experiment under constant ambient light and pseudo-random ambient light are presented in Fig. 7. In addition, the 3D surface reconstructed by depth from gradient algorithm [31], is shown for qualitative comparison between our approach and the FNF method. The effect of the ambient light can be seen in the 3D surface reconstruction column, where the FNF PS underperforms in a dynamic light.

Fig. 7. Normal map reconstruction and angular error of the real-world experiment and 3D surface reconstruction. The dynamic ambient light was achieved using a flashlight flickering according to a pseudo-random sequence. The ground truth is defined as the normal map obtained by classic PS in a dark environment with 1000 frames per image.

Download Full Size | PDF

5.3 Discussion

There are a few key differences between the real-world experiment and the simulation. First, we assume an orthographic camera model, although a perspective projection is more suitable for the real-world experiment [5]. Second, the observed object is not perfectly Lambertian. In addition, the prototype is based on three light sources instead of eight in the synthetic data experiments. Therefore, it is more sensitive to noise, as explained in section 4. Moreover, due to the PWM controller limitations, the modulated sources frequencies are not integer numbers. Thus, the reconstruction described in (4) is less precise than the reconstruction in the simulation. Last, the modulation waves generated by the hardware are imperfect, i.e., they include parasitic harmonies, as can be seen in Fig. 6.

The results in Table 3 show that the FNF method performs better in a dark environment than LIPS. In contrast, the results in the synthetic experiment (Fig. 2) showed the opposite. This mismatch between the simulation and the real-world experiment arises from two main reasons. The first is that the evaluation reference in the real-world experiment was obtained using the FNF. Therefore, the results are biased in favor of the FNF. The second reason is the differences between the simulation and the real-world experiment mentioned above.

The distortions caused by the non-orthographic camera model and the not-perfectly Lambertian material influence both LIPS and the FNF. Therefore, it is reasonable to assume that they cannot explain the differences between the simulation and the real-world experiment. In contrast, non-integer frequencies might influence LIPS performance. Another factor that can explain the discrepancies in the results is the number of sources $M$. For FNF PS, every input image was calculated by averaging $398/({M + 1\; } )= 100$ frames. Thus, every image was the mean of 100 frames in the three source setup instead of 45 in the eight source setup. Therefore, for a constant total number of frames (e.g., 400), the FNF's ability to deal with sensor noise increases when the system contains fewer sources. The higher number of frames per image also affects the ability of the FNF to deal with periodic ambient light. When taking more frames, the averaging process is applied over more periods of the ambient light, and therefore the mean is more accurate. In contrast, the proposed method's ability to handle sensor noise and ambient light is independent of the number of sources M.

To bridge the gap between the results, the simulation experiment was performed again with three modulated light sources and the same model [32], as in the real-world experiment. The ambient light conditions were the same as described in the synthetic data section for Bunny (with Pisa environment light). Three different experiments were performed in each lighting condition: FNF PS, LIPS with the same non-integers frequencies as in the real-world experiment, and LIPS integers frequencies (15 experiments in total). Table 4 summarizes the results of all experiments.

Table 4. Results of the three sources synthetic experiment

View Table | View all tables in this article

The results confirm the assumption that non-integers frequencies cause larger errors than integers-frequencies. These results are also consistent with the results in Table 3 by showing that for a three-source setup, in a dark or a constantly lit environment, the FNF error is smaller than LIPS. The reconstruction using (4) is not perfect and has an error term that is related to spectral leakage [33]. Since the three sources setup is more sensitive to noise, the reconstruction error of (4) causes a higher angular error than in the eight sources setup.

For the FNF, the error is related to the sensor noise and the ambient light. Since every image is now a result of averaging 100 frames instead of 45, the sensor noise and periodic ambient light have less influence on the FNF results. This phenomenon offsets the sensitivity of three sources PS, and the angular error stays low. The high number of frames per image also explains why the FNF error increases only slightly in 10 Hz ambient light. However, the FNF fails to deal with 5 Hz flickering ambient light and non-periodic ambient light, when the mean and the median errors increase to 8.36° and 7.60°, respectively. In contrast, the LIPS results are almost constant for all lighting conditions.

To conclude, the simulation results in Table 4 are consistent with the results of the real-world experiment described in Table 3. Although LIPS has a built-in error caused by the imperfect reconstruction in (4), it can be reduced in two ways. The first is by taking more frames for the filter. The second is by using frequencies with a greater distance between them. In addition, the proposed method error is almost unaffected by ambient light, in contrast to the FNF. Although the FNF method has some advantages in a dark environment and under constant or high-frequency ambient light, it still fails to deal with more challenging light conditions. Moreover, the FNF advantage exists only for a setup with a small number of sources.

6. Conclusions

In this paper, a new imaging setup for unsynchronized PS was presented. We demonstrate that the new setup improves the classic PS performance in an illuminated environment by filtering out ambient light and sensor noise. Experiments on both synthetic and real-world data showed that our method outperformed FNF PS in an environment with dynamic ambient light. The lack of synchronization between the camera and the sources allows a simple installation of PS systems. In addition, the high-frequency modulation eliminates the unpleasant flicker effect of traditional PS systems. Although the new approach is based on classic PS for Lambertian objects, the principle of using modulated light sources can benefit any PS approach that assumes no ambient light. Future work will test the influence of the new setup on modern PS approaches such as [4,5,9].

Acknowledgments

We would like to thank Dr. Amir Kolaman, for his pioneering research that was the basis for this work, and for his help and advice during this research.

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data underlying the results presented in this paper are available in Ref . [34].

References

1. T. Malzbender, B. Wilburn, D. Gelb, and B. Ambrisco, “Surface enhancement using real-time photometric stereo and reflectance transformation,” Eurographics Symp. Render. Tech.245 (2006).

2. R. J. Woodham, “Photometric Method For Determining Surface Orientation From Multiple Images,” Opt. Eng. 19(1), 139–144 (1980). [CrossRef]

3. R. Basri, D. Jacobs, and I. Kemelmacher, “Photometric stereo with general, unknown lighting,” Int. J. Comput. Vis. 72(3), 239–257 (2007). [CrossRef]

4. J. Sun, M. Smith, L. Smith, S. Midha, and J. Bamber, “Object surface recovery using a multi-light photometric stereo technique for non-Lambertian surfaces subject to shadows and specularities,” Image Vis. Comput. 25(7), 1050–1057 (2007). [CrossRef]

5. A. Tankus and N. Kiryati, “Photometric stereo under perspective projection,” Proc. IEEE Int. Conf. Comput. Vis. I, 611–616 (2005).

6. A. Chakrabarti and K. Sunkavalli, “Single-image RGB photometric stereo with spatially-varying albedo,” Proc. - 2016 4th Int. Conf. 3D Vision, 3DV 2016258–266 (2016).

7. R. Anderson, B. Stenger, and R. Cipolla, “Color photometric stereo for multicolored surfaces,” Proc. IEEE Int. Conf. Comput. Vis.2182–2189 (2011).

8. C. Hernández, G. Vogiatzis, G. J. Brostow, B. Stenger, and R. Cipolla, “Non-rigid photometric stereo with colored light,” Proc. IEEE Int. Conf. Comput. Vis. pp. 1–8 (2007).

9. R. Mecca, A. Wetzler, A. M. Bruckstein, and R. Kimmel, “Near field photometric stereo with point light sources,” SIAM J. Imaging Sci. 7(4), 2732–2770 (2014). [CrossRef]

10. G. J. Brostow, C. Hernandez, G. Vogiatzis, B. Stenger, and R. Cipolla, “Video normals from colored lights,” IEEE Trans. Pattern Anal. Mach. Intell. 33(10), 2104–2114 (2011). [CrossRef]

11. J. Herrnsdorf, J. Mckendry, M. Stonehouse, L. Broadbent, G. C. Wright, M. D. Dawson, and M. J. Strain, “LED-based Photometric Stereo-Imaging Employing Frequency-Division Multiple Access,” 2018 IEEE Photonics Conf.1–2 (2018).

12. E. Le Francois, J. Herrnsdorf, L. Broadbent, M. D. Dawson, and M. J. Strain, “Top-down illumination photometric stereo imaging using light-emitting diodes and a mobile device,” Opt. Express 29(2), 1502–1515 (2021). [CrossRef]

13. A. Kolaman, M. Lvov, R. Hagege, and H. Guterman, “Amplitude Modulated Video Camera - Light Separation in Dynamic Scenes,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2016-Decem, 3698–3706 (2016).

14. L. Shen and P. Tan, “Photometric stereo and weather estimation using internet images,” 2009 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work. CVPR Work. 2009 2009 IEEE, 1850–1857 (2009).

15. L. F. Yu, S. K. Yeung, Y. W. Tai, D. Terzopoulos, and T. F. Chan, “Outdoor photometric stereo,” 2013 IEEE Int. Conf. Comput. Photogr. ICCP 2013 (Section 2), (2013).

16. C. H. Hung, T. P. Wu, Y. Matsushita, L. Xu, J. Jia, and C. K. Tang, “Photometric stereo in the wild,” Proc. - 2015 IEEE Winter Conf. Appl. Comput. Vision, WACV 2015302–309 (2015).

17. A. L. Yuille and D. Snow, “Shape and Albedo from Multiple Images using Integrability,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1997), pp. 158–164.

18. F. Logothetis, R. Mecca, Y. Qu au, and R. Cipolla, “Near-field photometric stereo in ambient light,” Br. Mach. Vis. Conf. 2016, BMVC 2016 (2016). [CrossRef]

19. J. Gu, T. Kobayashi, M. Gupta, and S. K. Nayar, “Multiplexed illumination for scene recovery in the presence of global illumination,” Proc. IEEE Int. Conf. Comput. Vis.691–698 (2011).

20. Amir Kolaman, “Underwater Video Color Correction Using Modulated Light,” Ph.D. dissertation, Department of Electrical and Computer Engineering, BGU, Beersheba (2019).

21. A. Kolaman, D. Malowany, R. R. Hagege, and H. Guterman, “Light invariant video imaging for improved performance of convolution neural networks,” IEEE Trans Circuits Syst Video Technol IEEE T CIRC SYST VID 29(6), 1584–1594 (2019). [CrossRef]

22. M. Granados, B. Ajdin, M. Wand, C. Theobalt, H. P. Seidel, and H. P. A. Lensch, “Optimal HDR reconstruction with linear digital cameras,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (1), 215–222 (2010).

23. H. J. Trussell and R. Zhang, “The dominance of Poisson noise in color digital cameras,” Proc. - Int. Conf. Image Process. ICIP329–332 (2012).

24. R. Ray, J. Birk, and R. B. Kelley, “Error Analysis of Surface Normals Determined by Radiometry,” IEEE Trans. Pattern Anal. Mach. Intell. PAMI-5(6), 631–645 (1983). [CrossRef]

25. O. Drbohlav and M. Chantler, “On optimal light configurations in photometric stereo,” Proc. IEEE Int. Conf. Comput. Vis. II, 1707–1712 (2005).

26. Wenzel Jakob, “Mitsuba renderer,” https://www.mitsuba-renderer.org.

27. Cosmo Wenman, “The Getty Caligula,” https://cosmowenman.com/2015/02/10/the-getty-caligula-published.

28. J. Ackermann and M. Goesele, “A survey of photometric stereo techniques,” Found. Trends Comput. Graph. Vis. 9(3-4), 149–254 (2015). [CrossRef]

29. X. Wu and G. Zhai, “Temporal psychovisual modulation: A new paradigm of information display,” IEEE Signal Process. Mag. 30(1), 136–141 (2013). [CrossRef]

30. USC Institute for Creative Technologies, “High-Resolution Light Probe Image Gallery,” https://vgl.ict.usc.edu/Data/HighResProbes/.

31. Ying Xiong. “PSBox” (https://www.mathworks.com/matlabcentral/fileexchange/45250-psbox), MATLAB Central File Exchange. Retrieved January 26, 2022.

32. Weeliano, “Face on Table?,” https://www.thingiverse.com/thing:989487.

33. F. J. Harris, “On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform,” Proc. IEEE 66(1), 51–83 (1978). [CrossRef]

34. Y. Braun, “LIVIPS,” GitHub repository 2022, https://github.com/yuvalbraun/LIVIPS.

Frames per Image	Mean Angular Error	Median Angular Error
$5$	${8.68}^{\circ}$	${7.60}^{\circ}$
$10$	${6.35}^{\circ}$	${5.61}^{\circ}$
$20$	${4.76}^{\circ}$	${4.20}^{\circ}$
45	${3.57}^{\circ}$	${3.11}^{\circ}$
50	${3.46}^{\circ}$	${2.99}^{\circ}$
100	${2.93}^{\circ}$	${2.46}^{\circ}$
400	${2.41}^{\circ}$	${1.90}^{\circ}$

Ambient Light	Object	FNF (mean\median)	LIPS (mean\median)
No ambient light	Bunny	${3.57}^{\circ} ∖ {3.11}^{\circ}$	${3.47}^{\circ} ∖ {3.00}^{\circ}$
No ambient light	Caligula	${3.22}^{\circ} ∖ {2.84}^{\circ}$	${3.15}^{\circ} ∖ {2.76}^{\circ}$
Constant	Bunny	${3.61}^{\circ} ∖ {3.14}^{\circ}$	${3.50}^{\circ} ∖ {3.14}^{\circ}$
Constant	Caligula	${3.34}^{\circ} ∖ {2.94}^{\circ}$	${3.17}^{\circ} ∖ {2.78}^{\circ}$
10 Hz Flicker	Bunny	${7.53}^{\circ} ∖ {6.42}^{\circ}$	${3.48}^{\circ} ∖ {3.01}^{\circ}$
10 Hz Flicker	Caligula	${12.50}^{\circ} ∖ {10.30}^{\circ}$	${3.17}^{\circ} ∖ {2.79}^{\circ}$
Pseudo-Random Sequence	Bunny	${20.65}^{\circ} ∖ {14.71}^{\circ}$	${3.56}^{\circ} ∖ {3.05}^{\circ}$
Pseudo-Random Sequence	Caligula	${24.94}^{\circ} ∖ {14.54}^{\circ}$	${3.79}^{\circ} ∖ {3.30}^{\circ}$
Moving Source	Bunny	${24.88}^{\circ} ∖ {23.82}^{\circ}$	${3.48}^{\circ} ∖ {3.00}^{\circ}$
Moving Source	Caligula	${27.44}^{\circ} ∖ {27.42}^{\circ}$	${3.17}^{\circ} ∖ {2.77}^{\circ}$

Ambient Light	Error Index	FNF	Proposed Method
Dark	Mean	${0.51}^{\circ}$	${1.59}^{\circ}$
Dark	Median	$0^{\circ}$	${1.48}^{\circ}$
Constant	Mean	${1.74}^{\circ}$	${1.95}^{\circ}$
Constant	Median	${1.42}^{\circ}$	${1.89}^{\circ}$
10 Hz Flicker	Mean	${3.12}^{\circ}$	${2.41}^{\circ}$
10 Hz Flicker	Median	${3.09}^{\circ}$	${2.25}^{\circ}$
Pseudo-Random Sequence	Mean	${6.22}^{\circ}$	${1.65}^{\circ}$
Pseudo-Random Sequence	Median	${4.82}^{\circ}$	${1.51}^{\circ}$

Ambient Light	Error Index	FNF	Integers Frequencies	Non-Integers Frequencies
No ambient light	Mean	${3.42}^{\circ}$	${4.48}^{\circ}$	$5.33$
No ambient light	Median	${2.99}^{\circ}$	${4.05}^{\circ}$	$4.85$
Constant	Mean	${3.51}^{\circ}$	${4.50}^{\circ}$	${5.37}^{\circ}$
Constant	Median	${3.07}^{\circ}$	${4.07}^{\circ}$	${4.87}^{\circ}$
10 Hz Flicker	Mean	${3.94}^{\circ}$	${4.48}^{\circ}$	${5.34}^{\circ}$
10 Hz Flicker	Median	${3.53}^{\circ}$	${4.05}^{\circ}$	${4.84}^{\circ}$
5 Hz Flicker	Mean	8.36 $^{\circ}$	4.49 $^{\circ}$	${5.34}^{\circ}$
5 Hz Flicker	Median	7.60 $^{\circ}$	4.06 $^{\circ}$	${4.85}^{\circ}$
Pseudo-Random Sequence	Mean	${19.68}^{\circ}$	${4.50}^{\circ}$	${5.41}^{\circ}$
Pseudo-Random Sequence	Median	${17.28}^{\circ}$	${4.06}^{\circ}$	${4.92}^{\circ}$

Frames per Image	Mean Angular Error	Median Angular Error
$5$	${8.68}^{\circ}$	${7.60}^{\circ}$
$10$	${6.35}^{\circ}$	${5.61}^{\circ}$
$20$	${4.76}^{\circ}$	${4.20}^{\circ}$
45	${3.57}^{\circ}$	${3.11}^{\circ}$
50	${3.46}^{\circ}$	${2.99}^{\circ}$
100	${2.93}^{\circ}$	${2.46}^{\circ}$
400	${2.41}^{\circ}$	${1.90}^{\circ}$

Light invariant photometric stereo

Abstract

1. Introduction

2. Related work

2.1 Unsynchronized PS

2.2 PS with ambient light

2.3 Light invariant video imaging

3. Separation of modulated lights

4. Ambient light and sensor noise

5. Experiments

5.1 Synthetic data

5.2 Real-world data

5.3 Discussion

6. Conclusions

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (7)

Tables (4)

Equations (8)

Optics Express