BiPMAP: a toolbox for predicting perceived motion artifacts on modern displays

Guanghan Meng; Dekel Galor; Laura Waller; Martin S. Banks

doi:10.1364/OE.510985

1. Introduction

Displays are an interface that bridges the two ends of an information delivery system: the electronic signal supplied to the display and the viewer receiving the information converted to light by the display. For effective communication and a realistic visual experience, the interface should be tailored to the capabilities of the human visual system. Therefore, display manufacturers have worked to improve the design of their products—desktop monitors, projectors, televisions, cinema, and virtual-reality (VR) and augmented reality (AR) headsets—with the goal of achieving a realistic dynamic experience.

On a display the motion of an object is presented as a sequence of static views. A proper design would ensure that the viewer perceives smooth motion free of artifacts [1,2]. Unfortunately, this goal is often not achieved. Rather, a variety of motion artifacts can occur including flicker, judder (unsmooth motion), edge banding, motion blur, color breakup, and depth distortion [1,3–6]. To predict when artifacts will appear and how to minimize them, one needs to consider the properties of the input image sequence, the spatial and temporal parameters of the display, and the spatio-temporal visual sensitivity of the viewer as he/she is positioned to view the display. Our software tool aims to make accurate predictions of the above motion artifacts, given a set of user-defined parameters, in order to predict performance for modern display designs.

To model the visual processing of the input stimulus we employ the contrast sensitivity function (CSF). The CSF characterizes the visual system’s response to luminance variation in space and time. The CSF plots the reciprocal of the just-visible contrast at various spatiotemporal frequencies. Said another way, the CSF delineates how the visual system’s sensitivity to contrast varies across frequency bands, determining stimuli visibility. The visible domain has been called the window of visibility [1,2]. The motion on a display creates an output spatio-temporal frequency spectrum after filtering by the CSF. As long as that output spectrum is identical to the output spectrum of continuous real-world motion, the displayed object will be free of motion artifacts [1]. Luminance has a significant effect on the CSF, so a prediction tool should also incorporate that parameter. Our model includes the effects of spatial and temporal frequency as well as luminance while remaining computationally simple for practical use.

We developed, for the first time to our knowledge, a ’click-and run’ toolbox to guide the design of modern displays. We call it the Binocular Perceived Motion Artifact Predictor (BiPMAP). It enables prediction and visualization of motion artifacts. In this version of the toolbox, the input stimulus is a bright object moving horizontally at constant speed across a dark background [1,3]. Our main contributions are:

1. A pipeline that visualizes a variety of predicted motion artifacts through analysis in the Fourier domain, making use of the window-of-visibility concept.
2. A simple CSF model incorporating spatial frequency, temporal frequency, and luminance.
3. Modeling smooth eye movements to determine their predicted effect on artifacts.
4. Predicting color breakup in color-sequential displays.
5. Modeling binocular disparity in field-sequential stereo displays, enabling prediction of distortions in perceived depth.
6. An interface that accepts a comprehensive list of user-defined stimulus, display, and viewing parameters.
7. Toolbox as an interactive executable file with useful visualizations.

We note that other efforts along these lines have been presented [1,7–13]. But ours is the first, to our knowledge, to incorporate such a broad range of stimulus, display, and viewing parameters and to include the effect of eye movements and to include the phenomena of color breakup and depth distortions.

2. Background and related work

2.1 Window of visibility for the perception of a time-sampled moving stimulus

A displayed moving stimulus is a time-sampled version of the corresponding real continuous motion. A useful method for representing and analyzing sampled motion was developed by Watson and colleagues [1]. Here, we review this development, which forms the framework for our toolbox.

For simplicity, consider an infinitely long vertical line with unit contrast moving horizontally at constant speed. The resulting contrast distribution along the vertical axis is uniform, while along the horizontal axis $x$ is:

(1)$$l(x,t)= \delta(x-rt),$$

where $l(x,t)$ specifies the contrast as a function of position $x$ and time $t$. $r$ is the speed ($\Delta x$/$\Delta t$) and $\delta$ is the Dirac delta function. If the stimulus is presented stroboscopically (i.e., pixels are illuminated for a very short portion of each frame) on the display, the sampled stimulus $l_s(x,t)$ can be represented by the multiplication of the continuous contrast distribution and a sampling function:

(2)$$l_s(x, t) = \Delta t\delta (x-rt) \sum_{n={-}\infty}^{\infty} \delta (t - n\Delta t),$$

where $n$ is the frame number and $\Delta t$ is the sampling period.

If the stimulus is illuminated for the full duration of each frame (sample and hold), the sampled stimulus now has a temporal staircase presentation:

(3)$$l_z(x,t) = \Delta t\delta (x-rt) \sum_{n={-}\infty}^{\infty} \delta (t - n\Delta t) )\ast z(x,t),$$

where $z(x,t)$ is the unit staircase function [$\omega _s\delta (x)\text {rect}(t\omega _s)$] and $\ast$ denotes convolution and $\omega _s = 1/\Delta t$ is the sampling frequency.

Watson and colleagues approached the problem of predicting motion artifacts in the Fourier domain. The frequency spectrum of the continuous stimulus is:

(4)$$L(u,\omega) = \mathcal{F}_{x,t}[l(x,t)] = \delta (ru+\omega),$$

where $u$ is spatial frequency, $\omega$ is temporal frequency, and $\mathcal {F}$ denotes the Fourier transform. The above frequency spectrum is a line of infinite length and has a slope equal to the opposite reciprocal of the motion velocity in the space-time domain.

The stroboscopic sampled stimulus has a spectrum of:

(5)$$L_s (u, \omega) = \mathcal{F}_{x,t}[l_s (x,t)] = \sum_{n ={-}\infty}^{\infty} \delta (ru + \omega - n\omega_s),$$

which contains an infinite set of replicates of the spectrum of the continuous motion with a spacing of $\omega _s$ between intercepts on the temporal frequency axis.

For the staircase representation of a sampled stimulus, the Fourier transform yields the same function as Eqn. (5) multiplied by a temporal sinc function (sin$\pi \omega /\pi \omega$) whose effect is to modulate the amplitudes of the replicates:

(6)$$L_z (u, \omega) = \left[ \sum_{n ={-}\infty}^{\infty} \delta (ru + \omega - n\omega_s) \right] \text{sinc}(\frac{\omega}{\omega_s}).$$

To examine how the above frequency spectra are altered by the visual system, we use the CSF to quantify the resolution limits of the visual system in space and time. Watson and colleagues [1] approximated the boundary between visible and invisible frequencies as a rectangle in the spatial and temporal frequency domain, and called this the window of visibility [1]. As we said earlier, this conceptualization is very useful because it proposes that a discrete stimulus on a display will appear the same as a continuous stimulus whenever the frequency spectra after filtering by the CSF are the same.

2.2 Motion artifacts

We consider several motion artifacts:

• Flicker: Flicker is perceived temporal variation in brightness. In the frequency domain, flicker will be perceived when replicates encroach the window of visibility near a spatial frequency of zero: i.e., when the replicates intersect the temporal frequency axis inside the window of visibility and have non-zero amplitude at that intersection [3].
• Judder: Judder is an artifact in which motion appears unsmooth. It occurs when one or more of the sampling-induced spectral replicates fall within the window of visibility at non-zero spatial frequency.
• Edge banding: Edge banding is an artifact in which more than one instance of a moving edge is perceived. It often occurs along with judder. The likelihood of edge banding is greater when displays employ multi-flash protocols: repeated presentations of each frame [4].
• Motion blur: Motion blur is an artifact in which a sharp edge is perceived as blurred [2,14]. It occurs when the viewer makes a smooth tracking eye movement to follow the moving stimulus and the display is not stroboscopic. Judder and edge banding are not usually seen when motion blur is apparent.
• Color breakup: Color breakup, also known as the rainbow effect, is an artifact observed with displays that present colors sequentially [15]: the leading and trailing edges of a moving object with broad wavelength distribution are perceived as distinct color fringes. Although the artifact is more evident with tracking eye movements, it can also be seen when the eye does not move and the object moves past. Color breakup can be minimized or even eliminated by various methods [16–18] including applying spatial offsets to the 2^nd and 3^rd colors being presented [19].
• Depth distortion on stereoscopic displays: The above-mentioned motion artifacts occur in stereoscopic and non-stereoscopic displays alike. In addition, stereoscopic displays are frequently prone to another artifact: depth distortion [3,20,21]. Stereoscopic displays often present the images to the two eyes in temporal alternation (e.g., the left-eye’s image is presented at one time and the right-eye’s image at another time in that frame). This temporal offset causes an alteration in the brain’s estimate of the binocular disparity of a moving object, which in turn causes the object to appear at an unintended depth [3,20,22,23].

3. Overall design of BiPMAP

The BiPMAP toolbox has the following design goals. It should allow for a comprehensive list of user inputs, including the velocity of the moving stimulus, and nearly all display and viewing parameters. It should adopt a simple but accurate CSF model. The toolbox should present continuous and sampled stimuli side-by-side, so that motion artifacts caused by the discrete sampling that accompanies displays can be identified, investigated, and possibly eliminated by adjusting design parameters. Finally, it should allow visualization of a variety of motion artifacts in non-stereoscopic and stereoscopic displays.

BiPMAP consists of an interactive user interface and computational pipeline (Figs. 1 & 2). The executable toolbox is available at: https://github.com/CIVO-BiPMAP/executable/releases

Fig. 1. Overview of BiPMAP. (A) Front end with user-defined parameters divided into two components: Setup (configuration inputs) and Parameter Selection (simulation variables). (B) Back-end pipeline for artifact predictions containing two streamlines—Non Stereo (top) and Stereo (bottom)—and a figure-generation step. When "Compare Mode" in the Setup component of (A) is enabled, the Comparison Processor integrates information from past runs and outputs a comparison figure.

Download Full Size | PDF

Fig. 2. Input parameters in the user interface. The three panels on the left show the interfaces for selecting, respectively, stimulus, display, and viewing parameters. Within each panel, only parameters of the specified group can be defined. The panel on the right shows the interface for stereoscopic displays.

Download Full Size | PDF

3.1 Inputs and running modes

• Device selection: BiPMAP automatically detects the computational devices available to the user (e.g., CPU, GPU) and lists them under the ’Device Selection’ button (Fig. 1(A), Fig. 2), allowing the user to select one.
• Run type: BiPMAP has two running modes: motion-artifact predictions for non-stereoscopic displays (default) and depth distortions for stereoscopic displays. Users can switch between the two modes with a toggle.
• Compare mode: When users want to compare results from different runs, BiPMAP can run under ’compare mode’ which allows users to choose a master run that will be set as the reference during the comparison. By default—i.e., without selection of a master run—the perceived continuous motion will be the reference.

3.2 Motion-artifact predictions on non-stereoscopic displays

The pipeline consists of four steps to compute and visualize artifacts (Fig. 1(B)):

1. Configure the continuous and sampled motion pipelines with user-defined parameters (Fig. 1(B), panel "Generate Stimulus").
2. Compute the Fourier transforms of the continuous and sampled stimuli (Fig. 1(B), panel "Compute Spectrum").
3. Apply spatiotemporal filtering due to the CSF—i.e., a non-binary window of visibility—to obtain the output spectra (Fig. 1(B), panel "Apply Visual Model"), by multiplying the CSF to the spectrum computed above.
4. Reconstruct the perceived stimuli via inverse Fourier transforms of the CSF-filtered spectra above, with continuous and sampled representations side-by-side so that any motion artifacts can be easily identified ((Fig. 1(B), panel "Reconstruction").

The tool outputs and presents results from all four of the aforementioned steps, including both continuous and sampled representations of the stimuli, to enhance user interpretation.

3.3 Input parameters & stimulus configuration

There are three sets of input parameters: Stimulus properties, display parameters, and viewing parameters (Fig. 2).

Stimulus parameters (Fig. 2, Stimulus):

• Velocity: Input is in cm/s and is converted to °/s within the pipeline given the user-defined viewing distance. Direction of motion is horizontal.
• Stimulus size: Stimulus width in cm along the dimension orthogonal to the motion axis. Converted into number of pixels and degrees using pixel size and viewing distances defined previously.
• Recording length: Duration of the sampling in seconds. Longer recording length improves spectral resolution but increases computation time; it is also bounded by the memory of the processing device. Default recording length is 0.5s. Length can be adjusted according to memory capacity and computational power of the GPU/CPU. When the toolbox is in RGB mode and the computational device is a GPU with <$24$GB of dedicated memory, we recommend decreasing the recording length.
• $L_{max}$: Luminance of the stimulus. Under RGB mode, it is the summed luminance from all three colors, and each color is assigned with a luminance of $\frac {1}{3} L_{max}$ (note that luminance is in standardized units of $cd/m^2$).

Display parameters (Fig. 2, Display):

• Number of flashes: (number of flashes): This parameter specifies the number of times each frame is displayed before updating to new image data.
• RGB mode: Three options are available: black and white (BW), field sequential (RGB-seq), and simultaneous (RGB-simul). If the user is not concerned with color artifacts, black-and-white mode is recommended to minimize processing speed and memory consumption.
• Capture rate: Frames per second (Hz), which is the number of samples along the motion trajectory per second. It differs from presentation rate when the number of flashes is more than 1.
• Hold interval: The proportion of the frame during which pixels are illuminated; it ranges from 0–1 (0 for stroboscopic; 1 for sample and hold).
• Pixel response: The time required for a pixel to reach full intensity and to return to the intensity before the frame started. The hold interval above is defined as the start of the rising pixel intensity to the end of the falling intensity. The rising and falling are both modeled as linear for computational simplicity.
• DPI: Pixels per inch. Pixel size is $1/DPI$.
• Fill factor: Pixel fill factor is the spatial proportion of the pixel that is illuminated; values from 0–1. Default value is 1, which means there is no gap between adjacent illuminated pixels.
• Contrast: Weber contrast, defined as: $(7)$$Contrast = \frac{L_{max} - L_{min}}{L_{min}},$$$ where $L_{min}$ is the luminance of the background. from Eqn. (7), background luminance is calculated as: $(8)$$L_{min} = \frac{L_{max}}{1+ Contrast},$$$
Under RGB mode, the contrasts for three colors are assigned separately, and $L_{max}$ is replaced with $\frac {1}{3}L_{max}$ (note that luminance is in standardized units of $cd/m^2$).

Viewing parameters (Fig. 2, Viewing)):

• Viewing distance: Distance from the display to the viewer’s eyes in cm.
• Object tracking: Checkbox to determine if the viewer’s eye will track the moving stimulus. If checked, $v_{eye} = v_{stimulus}$ so the retinal speed of the stimulus becomes $0$.

3.4 Mathematical representation of user-defined stimuli

Equations (2–6) set up the basic framework for our analysis. We next describe how we incorporated several additional parameters. Specifically, we provide details on how we expanded Watson’s development to enable a comprehensive list of user-defined parameters. If the reader wishes to skip the mathematical development, go to Section 3.5.

We first consider the situation in which no eye movement occurs. Starting from Eqn. (1), the contrast distribution from a narrow moving object with unit contrast can be expressed as:

(9)$$c_c(x,t) = \delta(x - rt) \ast ker(x),$$

where $c_c$ denotes the contrast distribution that is continuous in time and $ker(x)$ is the kernel that describes the spatial footprint of the object along the $x$ dimension:

(10)$$\begin{aligned} ker(x) & = \frac{1}{X} \text{rect}(\frac{x}{X})\left[ \sum_{k ={-}\infty}^{\infty}\delta (x - kp) \ast \frac{1}{fp} \text{rect}(\frac{x}{fp}) \right]\\ & = \frac{1}{Xfp} \text{rect}(\frac{x}{X})\sum_{k ={-}\infty}^{\infty} \text{rect}(\frac{x-kp}{fp}), \end{aligned}$$

where $X$ is the angular object width in degrees, $k$ is the pixel index, $p$ is angular pixel size in degrees, and $f$ is the pixel fill factor.

Temporal sampling of Eqn. (9) generates a stroboscopic representation of the moving stimulus:

(11)$$c_s(x,t) = \Delta t \delta(x-rt) \ast ker(x) \sum_{n ={-}\infty}^{\infty}\delta (t - n\Delta t).$$

We can use a generalized staircase function to incorporate the hold interval $h$ and pixel response $\tau$:

(12)$$z_g(t) = \frac{1}{(h\Delta t - \tau)\tau} \left[ \text{rect}(\frac{t}{h\Delta t - \tau}) \ast \text{rect}(\frac{t}{\tau}) \right],$$

where the hold interval ($h$) contains both the plateau and the linear rising and falling of the pixel intensity defined by convolution. Incorporating this into the stimulus representation:

(13)$$c_z(x,t) = c_s(x,t) \ast z_g(t).$$

Applying the Fourier transform to Eqn. (13), the spectrum of the configured stimulus is:

(14)$$\begin{aligned} C_z(u,\omega) & = \mathcal{F}_{x,t} \left[ c_s(x,t) \right] \mathcal{F}_{x,t} \left[ z_g(t) \right]\\ & = C_s(u,\omega) Z_g(\omega)\\ & = \left[ \delta(\omega + ru)K(u) \ast\sum_{n ={-}\infty}^{\infty}\delta(\omega - n\omega _s) \right] Z_g(\omega), \end{aligned}$$

where $K(u)$ is the Fourier transform of $ker(x)$ (Eqn. (10)):

(15)$$K(u) = \text{sinc}(Xu) \ast \left[\text{sinc}(fpu)\sum_{k ={-}\infty}^{\infty}\text{exp}({-}j2\pi kpu) \right],$$

$Z_g(\omega )$ is the Fourier transform of the generalized staircase function (Eqn. (12)):

(16)$$Z_g(\omega) = \text{sinc} \left[ (\frac{h}{\omega_s} - \tau)\omega \right] \text{sinc}(\tau\omega).$$

Plug Eqns. (15 and 16) into Eqn. (14):

(17)$$\begin{aligned}C_z(u,\omega) & = \sum_{k ={-}\infty}^{\infty} \text{sinc}(Xu) \ast \left[ \text{sinc}(fpu) \text{exp}({-}j2\pi kpu)\right]\\ & \sum_{n ={-}\infty}^{\infty} \delta (\omega + ru - n\omega_s)\\ & \text{sinc}\left[(\frac{h}{\omega_s} - \tau)\omega \right]\text{sinc}(\tau \omega). \end{aligned}$$

Thus far all frames are treated as black and white. The case for $RGB$ frames is similar, but with a few modifications. Each frame has three channels, and each channel has its own contrast distribution function in the same format but with different shifts in space and time. For the contrast distribution for all channels, the fill factor $f$ in $c_z$ in equations below should be replaced with $f_{RGB} = f/3$. In the simultaneous presentation mode of RGB colors:

(18)$$c_{simul}(x, t, i) = c_z(x - \frac{ip}{3}, t),$$

where $i$ is the index of color channels ($i = 0, 1, 2$ for red, green, and blue channels, respectively). Therefore, the frequency spectrum has an additional phase shift:

(19)$$C_{simul}(u,\omega,i) = C_{z}(u,\omega) \text{exp}({-}j\frac{2\pi ip}{3}u).$$

In the sequential $RGB$ mode, the frame period $\Delta t$ in the staircase function $z_g(x,t)$ is replaced with $\Delta t/3$, and sampling frequency $\omega _s$ in its Fourier transform $Z_g(u, \omega )$ is replaced with $3\omega _s$. There is an additional shift in time:

(20)$$c_{seq}(x, t, i) = c_z(x - \frac{ip}{3}, t - \frac{i\Delta t}{3})$$

The resulting frequency spectrum is:

(21)$$C_{seq}(u,\omega,i) = C_{z}(u,\omega) \text{exp} \left[{-}j\frac{2\pi i}{3}(pu + \frac{\omega}{\omega_s}) \right]. \;\;\;\;\;\;\;$$

Next, we consider the situation in which the viewer makes a smooth eye movement to track the stimulus. When tracking occurs, all captured frames produce the same contrast distribution on the retina at the frame onset. However, when the hold interval $h$ is greater than 0, the eye’s motion introduces opposite motion on the retina during a frame. Therefore:

(22)$$c_{eye} (x,t) = \left[ ker(x)\ast \Delta t \delta(x+rt) z_g(t) \right] \ast \sum_{n ={-}\infty}^{\infty} \delta(t - n \Delta t).$$

The trajectory is continuous during frame intervals given the smooth motion of the eye. The spectrum of the stimulus is:

(23)$$\begin{aligned}C_{eye} (u, \omega) & = \left[ K(u) \delta(\omega - ru) \right] \ast Z_g(\omega) \sum_{n ={-}\infty}^{\infty} \delta(\omega - n\omega_s)\\ & = K(u) Z_g(\omega - ru) \sum_{n ={-}\infty}^{\infty} \delta(\omega - n\omega_s)\\ & = K(u) \sum_{n ={-}\infty}^{\infty} Z_g(n\omega_s - ru) \end{aligned}$$

Similarly, the representation of $RGB$ stimulus incorporates the additional shifts and a different pixel fill factor. For the contrast distribution for each channel, the fill factor $f$ is now $f_{RGB} = f/3$. In the simultaneous presentation mode of $RGB$ colors:

(24)$$c_{eye simul}(x, t, i) = c_{eye}(x - \frac{ip}{3}, t)$$

where $i$ is the index of color channels which gives rise to a phase shift in the spectrum:

(25)$$C_{eye simul}(u,\omega,i) = C_{eye}(u,\omega) \text{exp}({-}j\frac{2\pi ip}{3}u).$$

The sequential $RGB$ mode is more complicated with object tracking. The additional time shift is accompanied by spatial offsets for different colors, and the staircase widths along both temporal ($\Delta t$ in $z_g(x,t)$) and spatial axes are reduced by a factor of 3:

(26)$$\begin{aligned}c_{eyeseq}(x, t, i) & = \left[ ker(x)\ast \Delta t \delta(x+rt ) \right]\\ & z_g(t)\frac{3}{rh\Delta t} \text{rect} \left[ \frac{3x - irh\Delta t}{rh\Delta t} \right]\\ & \ast \left[\sum_{n ={-}\infty}^{\infty} \delta(t - n \Delta t) \right] \ast \delta (x - \frac{ip}{3}) \end{aligned}$$

The resulting frequency spectrum is:

(27)$$\begin{aligned}C_{eyeseq}(u,\omega,i) & = K(u)Z_g(\omega - ru)\\ & \ast \left[ \text{sinc}(\frac{rh}{3\omega_s}u) \text{exp}({-}j\frac{2\pi rh}{3\omega_s}u)\right]\\ & \text{exp}({-}j\frac{2\pi ip}{3}u) \sum_{n ={-}\infty}^{\infty} \delta(\omega - n \omega _s) \end{aligned}$$

where the $\omega _s$ in $Z_g(\omega )$ is substituted by $3\omega _s$.

3.5 CSF model

A binary rectangular window, i.e., the "window of visibility" proposed by Watson [1], is an oversimplification of the frequency response of the visual system. Other factors that affect sensitivity, such as luminance, should be considered as well. The "pyramid of visibility" is an upgraded model, also proposed by Watson and colleagues, in which high-frequency sensitivity was fit with linear models (in log space). The pyramid defines the CSF as a 3D surface with a height that increases with luminance (Fig. 3), and is only valid for high-frequency bands [8]. The stelaCSF model, in contrast, is more thorough but complicated with five dimensions [12]. In our CSF model, we incorporated three dimensions: spatial frequency, temporal frequency, and luminance. Previous work has shown that the CSF is approximately separable at high frequencies, so we assumed separability to simplify computation. Our CSF model is:

(28)$$S(u, \omega) = \sqrt{S_s (u) S_t (\omega)},$$

where $S_s(u)$ is the spatial CSF and $S_t(\omega )$ is the temporal CSF. The parameters of the spatial model were determined from previous work [24]:

(29)$$S_s(u) = \frac{5200e^{{-}0.0016u^2(1 + 100/L)^{0.08}}}{\sqrt{(1 + \frac{144}{X_o^2} + 0.64u^2)(\frac{63}{L^{0.83}} + \frac{1}{1-e^{{-}0.02u^2}} )}},$$

where $X_o$ is the angular object size (deg), $L$ is luminance (cd/m$^2$), and $u$ is spatial frequency (cpd). We developed a temporal CSF model incorporating luminance by fitting an empirical function to a set of published psychophysical data [25,26] (Fig. 3(A)):

(30)$$S_t(\omega) = \frac{5360L^{2.51}e^{{-}0.16\omega^{L^{{-}0.017}}}}{\sqrt{(\frac{a\omega^{bL}}{L^{{-}4.98}} + 1)(\frac{c}{L} + \frac{d}{1.007-e^{f\omega^{3.8}}})-L^5}},$$

where $a = 2.1e+9$, $b = 9e-4$, $c = 1.2e-7$, and $d = -2.7e-4$ (adjusted $R^2$: 0.98). The original data were in terms of retinal illuminance (trolands) rather than luminance (cd/m$^2$), so we converted using an existing formula [7,24].

Fig. 3. Model of human contrast sensitivity function (CSF). (A) Empirical temporal CSF model (solid lines) obtained from fitting experimental data (round markers), plotted as a function of temporal frequency for different retinal illuminances (in trolands). (B) Dependence of spatiotemporal CSF on luminance. Log contrast sensitivity is plotted as a function of temporal ($\omega$) and spatial frequency ($u$) at two luminances: 0.5cd/m² (inner cone) and 160 cd/m² (outer cone).

Download Full Size | PDF

From Eqns. (28–30), we obtained our spatiotemporal CSF model. As expected, peak contrast sensitivity grows with increasing luminance. For example, it increases by a factor of 4 as luminance increases from $0.5$ to $160\mathrm {cd/m^2}$ (Fig. 3(B)). The specific CSF profile for the configured stimulus is determined based on the average luminance across the viewer’s fovea:

(31)$$L_{mean} = \frac{L_{max} X D_{fovea} + L_{min} (A_{fovea} - X D_{fovea})}{A_{fovea}}$$

where $D_{fovea} = 5.5$ is the angular fovea diameter [27], and $A_{fovea}$ is the area of the fovea calculated as:

(32)$$A_{fovea} = \frac{\pi D_{fovea}^2 }{4}$$

The predicted stimulus based on the input’s Weber contrast is then determined using the CSF model mentioned above.

3.6 Depth distortion in stereoscopic displays

In the stereoscopic mode, the user inputs parameters involved in calculating disparity. Binocular disparity is the position of a feature in the right eye relative to the same feature in the left eye. Specifically,

(33) $$d = x_r - x_l,$$

where $x_r$ and $x_l$ are the horizontal coordinates of the right- and left-eye images, respectively. When $d$ is positive, the disparity is uncrossed, which means that the feature should appear farther than the display screen.

The disparity error for a variety of stimulus speeds, capture rates, and flash numbers is:

(34)$$e = \left({\frac{v}{r}}\right)\left(\frac{1}{2f}\right),$$

where $e$ is the error in degrees, $v$ is horizontal speed in °/s, $r$ is capture rate in Hz, and $f$ is flash number. When $e$ is positive, the object has unintended uncrossed disparity and should appear farther than desired.

The right panel in Fig. 2 shows the user inputs.

Parameters in binocular disparity calculation:

• Capture mode: Simultaneous or alternating left- and right-camera capture. Simultaneous capture is the default because it is more common.
• Presentation mode: Simultaneous or alternating left- and right-eye presentation. Alternating presentation is the protocol for temporal-interlaced stereoscopic displays, which are fairly common. Alternating is the default.

4. Task-based functionality of BiPMAP

4.1 Judder

From Eqns. (1–6), increasing speed produces counterclockwise rotation in the space-time domain and horizontal shear in the frequency domain. Figure 4 shows this for speeds increasing from $1$–$10$cm/s. Slope in space-time is greater for the faster speed (Fig. 4, left column), and shallower in the Fourier domain (Fig. 4, second column). The third and fourth columns show, respectively, the spectra after filtering by the CSF, and perceived stimuli after filtering. If we assume a capture rate of 120Hz, only one component of the frequency spectrum for the slower sampled stimulus is filtered through the CSF (Fig. 4, second row, filtered spectrum), and that component is the same as the one in the continuous stimulus. In other words, the output frequency spectra of the continuous and sampled stimuli are identical (Fig. 4, first and second rows, filtered spectrum), which means that the sampled stimulus should be perceived as moving smoothly (Fig. 4, first and second rows, Reconstruction). In contrast, the faster stimulus produces more than three replicates that fall within the window of visibility (third and fourth rows). As a result, the reconstructed stimulus has gaps, which means that the perceived motion will be discontinuous: i.e., judder will occur. Note that the filtered spectra for the slower and faster stimuli only intersect the temporal frequency axis at the origin, which means that there will be no visible flicker despite the presence of judder.

Fig. 4. Effect of stimulus speed on judder. Stimulus column: The stimulus is a bright line whose position is plotted as a function of time. The user specifies line width and speed. Input spectrum column: 2D discrete Fourier transform of the stimulus. Spatial frequency and temporal frequency are plotted on the ordinate and abscissa, respectively. Brightness represents amplitude. Output spectrum column: Spectrum filtered by the CSF. Reconstruction column: Perceived stimulus reconstructed using inverse Fourier transform of the output spectrum. Top two rows: Results for continuous (first row) and sampled stimulus (second row) with a speed of 1cm/s (1.15°/s). No motion artifacts perceived from the sampled stimulus. Bottom two rows: Results for continuous (third row) and sampled stimulus (fourth row) with a speed of 10cm/s (11.42°/s). Pixel density: 300dpi. Capture rate: 120Hz. Hold interval: 0.5. Viewing distance: 50cm.

Download Full Size | PDF

The occurrence of judder and not flicker is also revealed in Fig. 5. In this set of examples, the speed is 1cm/s while the capture rate increases from 30Hz (Fig. 5, second row) to 120Hz (bottom row). Greater capture rates—i.e., sampling rates—push the replicates in the sampled frequency spectra farther from one another until only one remains inside the window of visibility. Thus at sufficiently high capture rates, the reconstructed stimulus (Fig. 5, bottom row) should appear to move smoothly (top row).

Fig. 5. Effect of capture rate on judder. Columns in same format as Fig. 4. Top row: Results for continuous stimulus. Second row: Output from sampled stimulus with $30$Hz capture rate. Third row: Same but with capture rate of $60$Hz. Fourth row: Same but with capture rate of $120$Hz. Stimulus speed: 1cm/s (1.15°/s). Pixel density: 300dpi. Hold interval: 0.5. Viewing distance: 50cm.

Download Full Size | PDF

When the hold interval is increased from a small value ($\sim$0) to a large one ($\sim$1), the temporal sinc function attenuates higher frequencies in the replicates, but that attenuation occurs mostly outside the window of visibility. Thus increasing the hold interval does little to suppress judder.

4.2 Motion blur

When viewers track a moving continuous stimulus, the velocity on the retina is zero, but because the brain measures the eye motion through extra-retinal signals, the stimulus will appear to be moving and doing so smoothly. When viewers track a sampled stimulus of the same velocity, they perceive the motion but also often experience motion blur [14]. The hold interval is crucial here. When the hold interval is large, blur is experienced because the static image of the stimulus in one frame is smeared across the retina as the eye keeps moving [28]. This is illustrated by Fig. 6, bottom row. Notice the diagonal lines in the left panel, which is the smearing across the retina during the hold interval. The limited spatiotemporal bandwidth of the visual system acts like a low-pass filter and causes a sharp contour to appear blurred. Reducing the hold interval causes less retinal smear and therefore motion blur is minimized or even eliminated (Fig. 6, second row).

Fig. 6. Effects of hold interval and tracking eye movement on motion blur. Columns in same format as Figs. 4 and 5. Stimulus position and spatial frequency are in retinal coordinates. Viewer makes a tracking eye movement to fixate the moving stimulus. The continuous stimulus (row 1) is therefore stationary on the retina. Hold interval in the sampled stimulus is 0.1 and 1 in the second and third rows, respectively. Motion blur occurs with large but not small hold interval. Stimulus velocity on display: 20cm/s (22.62°/s). Pixel density: 300dpi. Capture rate: 120Hz. Viewing distance: 50cm.

Download Full Size | PDF

4.3 Color breakup in field-sequential RGB displays

Color breakup is another artifact that affects the viewing experience. It occurs when the display presents colors sequentially as with most DLP projectors [29]. Breakup is most noticeable when viewers track the displayed stimulus with smooth eye movements (Fig. 7, middle row), but it can also be seen when the eye is stationary and the stimulus moves past [15]. When red, green, and blue are presented sequentially within a frame and the viewer tracks the stimulus, each color will fall on a slightly different part of the retina. As a result, a white line on a black background will be imaged on the retina as three spatially separated lines: red, green, and blue; red will appear farther in the direction of motion than green, which will appear farther than blue. Thus, color breakup is an appearance of spatial color fringing caused by offsets in time. One can apply spatial offsets to eliminate the artifact [19]. The spatial offset required for each color channel is:

(35)$$\textit{offset}_i = (i - 1)\frac{v}{3r},$$

where $i$ is the presentation order for the color channel that is being corrected, ranging from $1$–$3$, $v$ is stimulus speed, and $r$ is the capture rate. The effect of adding such offsets is shown in Fig. 7. When we add spatial offsets to green and blue, color breakup is successfully eliminated (Fig. 7, bottom row).

Fig. 7. Color breakup with tracking eye movement. Left column: Stimulus moving at 20cm/sec (22.62°/s) with tracking eye movement. Position on the retina is plotted as a function of time. Right column: Reconstructed stimuli. Top row: Continuous stimulus with red, green, and blue presented simultaneously. Middle row: Sampled stimulus presented with color-sequential display. Bottom row: Sampled stimulus with spatial offsets to counter color breakup. Stimulus speed: 20cm/sec (22.62°/s). Pixel density: 300dpi. Capture rate: 120Hz. RGB mode: sequential. Hold interval: 1. Viewing distance: 50cm.

Download Full Size | PDF

These predictions from BiPMAP are confirmed by experimental data from Johnson and colleagues [19]. They showed that when one applies the spatial offsets specified by Eqn. (35), color breakup is completely eliminated when the viewer tracks a moving object and is greatly diminished when the viewer does not track a moving object.

4.4 Depth distortion in stereoscopic displays

When filming or generating content for stereoscopic displays, the left and right images are usually captured simultaneously. Many stereoscopic displays then present the left and right images in temporally alternating fashion to the two eyes [30]. As a consequence, horizontally moving objects can appear to be displaced in depth; this is a depth distortion [31]. To understand the cause of this effect, we must look into how the brain estimates binocular disparity. Consider a situation in which the disparity of a moving object is intended to be zero (meaning the left- and right-eye images are the same and the object is meant to appear at the depth of the display screen) and the left- and right-eye images are presented in alternation in a given frame. The brain must pair the left and right images in order to compute the binocular disparity. The problem for a given left-eye image is whether to pair it with the following right-eye image or the preceding one. If the left-eye image occurs in the first half of the frame, the pairing with the following image is correct and the pairing with the preceding one is incorrect. But the brain has no way to know which pairing is the correct one, so it matches using an average of the two [23,32]. Specifically, the left-eye image is paired with both the preceding and following right-eye images and the disparity is derived from the average of those two pairings. The estimated disparity is therefore incorrect, which produces an apparent displacement of the object in depth. BiPMAP allows the user to determine what the estimated disparity is likely to be for different stimuli and display protocols.

Consider an object moving from left to right with zero disparity (i.e., it should be seen in the plane of the display). When the left-eye image is presented before the right-eye image in each frame, the estimated disparity will be the average of 0 (pairing with the following image) and -${\Delta }x$ (pairing with preceding image). So instead of obtaining the correct estimate of zero disparity, the brain obtains an incorrect estimate of -${\Delta }x/2$, which in the example shown in the left column of Fig. 8 produces an average disparity error of 5.65minarc, which is readily visible because it is more than an order of magnitude greater than the disparity threshold. The error can be derived from Eqn. (34) by substituting $-{\Delta }x/{\Delta }t$ for $v$, $1/{\Delta }t$ for $r$, and $1$ for $f$:

(36)$$e = \left(\frac{-{\Delta}x}{{\Delta}t}\right)\left(\frac{{\Delta}t}{2}\right) ={-}\frac{{\Delta}x}{2}.$$

Because of this, the object’s perceived distance is nearer than it should be: i.e., a depth distortion. If the motion was right to left, the object would appear farther than it should. These predictions of BiPMAP have been confirmed experimentally by Hoffman and colleagues using the same model [3]. They showed that when simultaneously captures images are displayed alternately to the two eyes, the depth distortion predicted by BiPMAP is observed. And they showed that the distortion can be eliminated by inserting a nulling disparity that has the opposite sign from the disparity distortion predicted by BiMPAP.

Fig. 8. Estimated disparity with different display protocols. Labels for each column indicate that capture ($C$) is simultaneous and presentation ($P$) is sequential. The number followed by $X$ is the flash number. Top row: Simultaneous capture and sequential presentation with different flash numbers for a stimulus with a nominal disparity of zero moving left to right. Position on the display is plotted as a function of time. Red lines indicate the left-eye presentation and blue lines the right-eye presentation. The first, second, and third panels show the displayed stimulus for single-, double-, and triple-flash, respectively. Bottom row: Estimated disparity for these protocols. Estimated disparity is plotted as a function of time. Green dots represent the disparities associated with left- and right-eye pairings in which right eye leads or lags the left eye. Dashed line indicates the estimated disparity. Stimulus speed: $10$cm/s (11.31$^{\circ }$/s). Capture rate: $60$Hz. Viewing distance: 50cm.

Download Full Size | PDF

This depth distortion can be eliminated, of course, by presenting the images to the two eyes simultaneously or by capturing the left- and right-eye image data sequentially at the same alternation rate as will be used in the presentation. This BiPMAP prediction has been confirmed experimentally [3].

Field-sequential presentation is commonplace in stereoscopic cinema and is usually accompanied by multi-flash. For example, in the RealD protocol each pair of images is presented three times (’triple flash’) before updating to a new pair of images [33]. This reduces the magnitude of the expected depth distortion by a factor of 3 as shown by Eqn. (34) and the right panel of Fig. 8. The expected effect with double flash is smaller and illustrated by the middle panel of the figure.

When the viewer tracks the horizontally moving object with a smooth eye movement, the depth distortion created by alternating presentation is the same as that without the eye movement. This can be shown by running BiPMAP in stereo mode with eye tracking enabled (Fig. 9). This prediction has been confirmed experimentally [3].

Fig. 9. Effect of eye motion on estimated disparity. Left panel: Stimulus (top) and estimated disparity (bottom) with no tracking eye movement. Right panel: Stimulus (top) and estimated disparity (bottom) with tracking eye movement. Green dots represent the disparities associated with left- and right-eye pairings in which right eye leads or lags the left eye. Dashed line is the estimated disparity. The estimated disparity is the same with and without eye tracking. Stimulus speed: $10$cm/s (11.31$^{\circ }$/s). Capture rate: $60$Hz. Viewing distance: 50cm.

Download Full Size | PDF

5. Conclusions and future direction

We developed a toolbox—BiPMAP—for predicting and visualizing motion artifacts that are often seen when viewing digital displays. The toolbox enables users to input parameters including the stimulus configuration, the properties of the display, and viewing parameters. They can then determine whether motion artifacts will be seen and, if they are, which ones will be seen. By adjusting input parameters, the user can determine how best to minimize or eliminate the artifacts. Hopefully, this tool will aid the development of future displays.

BiPMAP could in future be further extended to include other parameters such as retinal eccentricity, chromatic vs luminance variation, and refractive error [12,24,34,35]. The model of the visual system could incorporate additional nonlinear mechanisms, more sophisticated chromatic models, and more sophisticated eye movement models [36]. In addition, more complicated stimuli could be added at the front end: i.e., natural video. Finally, image-quality metrics could be included so that the user can obtain a quantitative estimate of the quality of the display being prototyped [37–39]. Current predictions of judder and blur could be extensively calibrated via further experiments. We intend to add these features in future additions of the toolbox.

Funding

Center for Innovation in Vision and Optics.

Acknowledgement

G. Meng was funded by the Center for Innovation in Vision and Optics.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper is available in Ref. [26].

References

1. A. B. Watson, A. J. Ahumada, and J. E. Farrell, “Window of visibility: a psychophysical theory of fidelity in time-sampled visual motion displays,” J. Opt. Soc. Am. A 3(3), 300–307 (1986). [CrossRef]

2. A. B. Watson, “High frame rates and human vision: A view through the window of visibility,” SMPTE Mot. Imag. J 122(2), 18–32 (2013). [CrossRef]

3. D. M. Hoffman, V. I. Karasev, and M. S. Banks, “Temporal presentation protocols in stereoscopic displays: Flicker visibility, perceived motion, and perceived depth,” J. Soc. Inf. Disp. 19(3), 271–297 (2011). [CrossRef]

4. P. V. Johnson, J. Kim, D. M. Hoffman, et al., “Motion artifacts on 240-hz oled stereoscopic 3d displays,” J. Soc. Inf. Disp. 22(8), 393–403 (2014). [CrossRef]

5. S. Daly, N. Xu, J. Crenshaw, et al., “A psychophysical study exploring judder using fundamental signals and complex imagery,” SMPTE Mot. Imag. J 124(7), 62–70 (2015). [CrossRef]

6. G. Denes, A. Jindal, A. Mikhailiuk, et al., “A perceptual model of motion quality for rendering with adaptive refresh-rate and resolution,” ACM Trans. Graph. 39(4), 133 (2020). [CrossRef]

7. P. G. J. Barten, Contrast Sensitivity of the Human Eye and Its Effects on Image Quality (SPIE, 1999).

8. A. Watson and A. Ahumada, “The pyramid of visibilty,” J. Vis. 16(12), 567 (2016). [CrossRef]

9. A. Chapiro, R. Atkins, and S. Daly, “A luminance-aware model of judder perception,” ACM Trans. Graph. 38(5), 1–10 (2019). [CrossRef]

10. A. Jindal, K. Wolski, K. Myszkowski, et al., “Perceptual model for adaptive local shading and refresh rate,” ACM Trans. Graph. 40(6), 1–18 (2021). [CrossRef]

11. R. K. Mantiuk, G. Denes, A. Chapiro, et al., “Fovvideovdp: A visible difference predictor for wide field-of-view video,” ACM Trans. Graph. 40(4), 1–19 (2021). [CrossRef]

12. R. K. Mantiuk, M. Ashraf, and A. Chapiro, “stelaCSF: A unified model of contrast sensitivity as the function of spatio-temporal frequency, eccentricity, luminance and area,” ACM Transactions on Graphics (2022).

13. B. Krajancich, P. Kellnhofer, and G. Wetzstein, “A perceptual model for eccentricity-dependent spatio-temporal flicker fusion and its applications to foveated graphics,” ACM Trans. Graph. 40(4), 1–11 (2021). [CrossRef]

14. M. A. Klompenhouwer, “54.1: Comparison of lcd motion blur reduction methods using temporal impulse response and mprt,” in SID symposium digest of technical papers, vol. 37 (Wiley Online Library, 2006), pp. 1700–1703.

15. M. Mori, T. Hatada, K. Ishikawa, et al., “Mechanism of color breakup on field-sequential color projectors,” in SID Symposium Digest of Technical Papers, vol. 30 (Wiley Online Library, 1999), pp. 350–353.

16. C.-H. Chen, F.-C. Lin, Y.-T. Hsu, et al., “A field sequential color lcd based on color fields arrangement for color breakup and flicker reduction,” J. Disp. Technol. 5(1), 34–39 (2009). [CrossRef]

17. F.-C. Lin, Y.-P. Huang, and H.-P. D. Shieh, “Color breakup reduction by 180 hz stencil-fsc method in large-sized color filter-less lcds,” J. Disp. Technol. 6(3), 107–112 (2010). [CrossRef]

18. Z. Qin, Y. Zhang, F.-C. Lin, et al., “A review of color breakup assessment for field sequential color display,” Inf. Disp. 35(2), 13–43 (2019). [CrossRef]

19. P. V. Johnson, J. Kim, and M. S. Banks, “The visibility of color breakup and a means to reduce it,” J. Vis. 14(14), 10 (2014). [CrossRef]

20. D. C. Burr and J. Ross, “How does binocular delay give information about depth?” Vision Res. 19(5), 523–532 (1979). [CrossRef]

21. P. V. Johnson, J. Kim, and M. S. Banks, “Stereoscopic 3d display technique using spatiotemporal interlacing has improved spatial and temporal properties,” Opt. Express 23(7), 9252–9275 (2015). [CrossRef]

22. M. Morgan, “Perception of continuity in stroboscopic motion: a temporal frequency analysis,” Vision Res. 19(5), 491–500 (1979). [CrossRef]

23. J. C. Read and B. G. Cumming, “The stroboscopic pulfrich effect is not evidence for the joint encoding of motion and depth,” J. Vis. 5(5), 3 (2005). [CrossRef]

24. P. G. Barten, “Formula for the contrast sensitivity of the human eye,” in Image Quality and System Performance, vol. 5294 (SPIE, 2003), pp. 231–238.

25. A. B. Watson, “Temporal sensitivity,” Handbook of Perception and Human Performance 1, 1–43 (1986).

26. H. De Lange Dzn, “Research into the dynamic nature of the human fovea? cortex systems with intermittent and modulated light. i. attenuation characteristics with white and colored light,” J. Opt. Soc. Am. 48(11), 777–784 (1958). [CrossRef]

27. A. Hendrickson, “Organization of the adult primate fovea,” in Macular Degeneration, (Springer, 2005).

28. X.-F. Feng, “Lcd motion-blur analysis, perception, and reduction using synchronized backlight flashing,” in Human Vision & Electronic Imaging XI, vol. 6057 (SPIE, 2006), pp. 213–226.

29. L. J. Hornbeck, “Digital light processing for high-brightness high-resolution applications,” in Projection Displays III, vol. 3013 (SPIE, 1997), pp. 27–40.

30. M. Park, J. Kim, and H.-J. Choi, “Effect of interlacing methods of stereoscopic displays on perceived image quality,” Appl. Opt. 53(3), 520–527 (2014). [CrossRef]

31. J. Kim, “An overview of depth distortion in stereoscopic 3d displays,” J. Inf. Disp. 16(2), 89–97 (2015). [CrossRef]

32. M. S. Banks, D. M. Hoffman, J. Kim, et al., “3d displays,” Annual Review of Vision Science (2016).

33. M. Cowan, “Real d 3d theatrical system-a technical overview,” in European Digital Cinema Forum, (2008).

34. A. B. Watson, “The field of view, the field of resolution, and the field of contrast sensitivity,” Electron. Imaging 2018(13), 1–5 (2018). [CrossRef]

35. K. T. Mullen, “The contrast sensitivity of human colour vision to red-green and blue-yellow chromatic gratings,” The J. Physiol. 359(1), 381–400 (1985). [CrossRef]

36. S. J. Daly, “Engineering observations from spatiovelocity and spatiotemporal visual models,” in Human Vision and Electronic Imaging III, vol. 3299B. E. Rogowitz and T. N. Pappas, eds., International Society for Optics and Photonics (SPIE, 1998), pp. 180–191.

37. A. K. Venkataramanan, C. Wu, A. C. Bovik, et al., “A hitchhiker’s guide to structural similarity,” IEEE Access 9, 28872–28896 (2021). [CrossRef]

38. H. Z. Nafchi, A. Shahkolaei, R. Hedjam, et al., “Mean deviation similarity index: Efficient and reliable full-reference image quality evaluator,” IEEE Access 4, 5579–5590 (2016). [CrossRef]

39. E. Prashnani, H. Cai, Y. Mostofi, et al., “Pieapp: Perceptual image-error assessment through pairwise preference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2018), pp. 1808–1817.

BiPMAP: a toolbox for predicting perceived motion artifacts on modern displays

Abstract

1. Introduction

2. Background and related work

2.1 Window of visibility for the perception of a time-sampled moving stimulus

2.2 Motion artifacts

3. Overall design of BiPMAP

3.1 Inputs and running modes

3.2 Motion-artifact predictions on non-stereoscopic displays

3.3 Input parameters & stimulus configuration

3.4 Mathematical representation of user-defined stimuli

3.5 CSF model

3.6 Depth distortion in stereoscopic displays

4. Task-based functionality of BiPMAP

4.1 Judder

4.2 Motion blur

4.3 Color breakup in field-sequential RGB displays

4.4 Depth distortion in stereoscopic displays

5. Conclusions and future direction

Funding

Acknowledgement

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Equations (36)

Optics Express