Improved texture reproduction assessment of camera-phone-based medical devices with a dead leaves target

Nitin Suresh; T. Joshua Pfefer; Jiahao Su; Yu Chen; Quanzeng Wang

doi:10.1364/OSAC.2.001863

1. Introduction

1.1 Performance of camera phones for medical image acquisition

The portability, cost, multi-functionality and ubiquity of camera-enabled mobile phones (called camera phones in this paper) have made them a potentially transformative tool for telemedicine applications. Specialized software applications (commonly called “apps”) for medical imaging and examination (e.g., mobile tele-dermatology) are commercially available and are widely used. Multiple applications for camera-phone-based medical devices (CPMDs) are under clinical study, from melanoma detection to microsurgery [1–3]. Additionally, CPMDs incorporating specialized software and external optical components have been developed for colposcopy, otoendoscopy and retinal fundus photography. However, given the established potential for misdiagnosis due to poor image quality [3], accurate regulatory decision-making on mobile devices with significant risk potential is a clear concern [4].

The unique hardware and software features of CPMDs necessitate novel solutions to elucidate the potential for clinical errors. Camera phones lack complex lens systems and large sensors used on digital single-lens-reflex (DSLR) cameras in order to minimize final system weight and manufacturing cost, while still achieving satisfactory performance for consumers. Hardware shortcomings are often compensated with extensive image modifications performed by an image signal processor (ISP). After a raw image is captured on the image sensor of a typical camera phone, it undergoes image processing steps such as pixel defects correction, demosaicing, shading correction, geometric correction, color correction, tone curve adjustment, edge enhancement, noise reduction, and JPEG compression in the ISP pipeline. Consequently, images captured by a CPMD commonly contain residual noise and artifacts introduced in the image processing steps. Of these image processing steps, the noise reduction and edge enhancement modules are of particular interest, as they can skew certain image quality metrics such as the modulation transfer function ($MTF$). $MTF$ is essentially an image quality metric [5] used to describe image sharpness, or how well a system reproduces frequency information in a target.

1.2 ‘Dead leaves’ (DL) targets and algorithms for texture reproduction assessment

While the application of image processing algorithms in camera phones can improve visual image quality, high-frequency texture details can be inaccurately identified as noise and removed by aggressive non-linear noise reduction algorithms, which can negatively affect texture reproduction. Since many medical diagnoses depend on accurate reproduction of tissue texture in an optical image, quantifying this capability of a CPMD is essential.

$MTF$ measurement methods based on a slanted-edge ($MT{F_{SE}}$) or a sine-wave ($MT{F_{SW}}$) target have been widely used to evaluate digital cameras, and are described in a recently updated international standard [5]. However, these methods might give erroneous results for CPMDs with non-linear image processing algorithms, since edges and regular patterns are often processed differently compared to more heterogeneous regions. To overcome this limitation, more complex DL targets and algorithms have been developed to assess highly processed images. Recent studies have shown that the MTF based on a DL target ($MT{F_{DL}}$) is useful in evaluating texture reproduction in camera phones and could be used in place of time-consuming subjective image quality evaluation [6,7].

A DL target (Fig. 1a) has a central texture region (also called the DL region) that consists of a series of overlapping low-contrast discs. The discs have varying radii $(r )$ distributed proportionally to$\; {r^{ - 3}}$ and uniformly distributed gray levels; giving an accurate stochastic occlusion model that mimics a natural scene [8]. This property of a DL target makes it suitable for realistic evaluation of texture reproduction under real-world conditions; including for medical images (e.g., Fig. 1b). The texture region of a DL target is surrounded by spectrally neutral patches, used for evaluating the camera’s optoelectronic conversion function (OECF).

Fig. 1. Similarity between (a) a DL chart and (b) a skin image with random texture.

Download Full Size | PDF

Various approaches have been proposed to measure $MT{F_{DL}}$ using the ideal power spectral density ($PSD$) of a DL target ($PS{D_{target}}$), which can be analytically modeled. Cao et al. [6] initially proposed to evaluate the $MT{F_{DL}}$ of an imaging system as in Eq. (1),

(1)$$MT{F_{DL - Cao}} = \sqrt {PS{D_{image}}\; /\; PS{D_{target}}} $$

where $PS{D_{image}}$ is measured from the texture region of the target image. At high spatial frequencies, $PS{D_{image}}$ would be dominated by high frequency noise and artifacts (e.g., sensor noise and JPEG artifacts), causing an artificial increase in its value. McElvain et al. [7] refined this by including a $PS{D_{noise}}$ correction term in the $MT{F_{DL}}$ algorithm as in Eq. (2).

(2)$$MT{F_{DL - gray}} = \sqrt {(PS{D_{image}} - PS{D_{noise - gray}})\; /\; PS{D_{target}}} $$

Ideally, $PS{D_{noise}}$ should be calculated from the texture region (denoted $PS{D_{noise - tex}}$). Since $PS{D_{noise - tex}}$ is difficult to obtain, it has been approximated in Eq. (2) by the $PS{D_{noise}}$ of a uniform gray patch with 50% reflectance (denoted $PS{D_{noise - gray}}$) on the DL chart. The drawback in this approximation is that $PS{D_{noise - gray}}$ might not accurately estimate $PS{D_{noise - tex}}$ for images captured under certain conditions. Under the reasonable assumption that a camera ISP treats different regions of the scene differently, a homogeneous gray region in a DL target image may not be processed in the same manner as the texture region. While denoising the homogeneous gray region can be performed using a simple filter, such a technique would negatively affect the reconstruction of textured regions in the image [9]. This makes it important to verify the assumption that values of $PS{D_{noise - tex}}$ and $PS{D_{noise - gray}}$ are similar. Using $PS{D_{noise - gray}}$ to replace $PS{D_{noise - tex}}$ might lead to an underestimation of the noise present.

A recently proposed full-reference $MT{F_{DL}}$ method aims to overcome this limitation by using both the amplitude and phase information of the optical transfer function (OTF) of the reference and captured image data [10,11]. Kirk et al. provide examples of devices that show unexpected behavior in their $MT{F_{DL - gray}}$ at high ISO levels. The $MT{F_{DL - gray}}$ was shown to exhibit high contrast values for images taken at high ISO, in contradiction to their low visual quality [10]. This full-reference method requires the availability of a reference image – an ideal image that faithfully represents the target, followed by rigorous registration of captured and reference image data. Some shortcomings of the full-reference $MT{F_{DL}}$ method include that the digital reference data are often not available to end users and the method is vulnerable to geometric distortion but not sensitive to high-level residual noise level. The full-reference $MT{F_{DL}}$ method is not our focus in this paper.

1.3 Scope of this study

In this paper, we reviewed generative models of imaging and denoising methodologies, and identified denoising techniques and their related parameters that are most effective. We compared $PS{D_{noise - gray}}$ and $PS{D_{noise - tex}}$ values to evaluate the accuracy of the previously published $MT{F_{DL - gray}}$ approach. Subsequently, we proposed an alternative noise-robust semi-reference approach, called the $MT{F_{DL - den}}$ approach, that can be used to evaluate $MT{F_{DL}}$ based on continuously captured images. The $MT{F_{DL - den}}$ approach provides better noise correction than the $MT{F_{noise - gray}}$ approach that causes the discrepancies in $MT{F_{DL}}$ reported previously [10].

2. Generative models of images and denoising methodologies

2.1 Generative model of raw images

When a device is used to image a real-world target, the captured image may not exactly reproduce the target, due to limitations of the imaging device in terms of optical component and sensor capabilities. Traditionally, such degradation is modeled by two separate factors, distortion and noise, according to the following mathematical equation

(3)$$y({i,j} )= ({h\ast x} )({i,j} )+ n({i,j} )$$

where $x \in \; {R^{m \times n}}$ is the real-world target and $y \in \; {R^{m \times n}}$ is the digital image. Distortion and noise are denoted as $h \in {R^{k \times k}}$ and $n \in {R^{m \times n}}$ respectively, and the operation ‘*’ between h and x represents convolution. Therefore, this model assumes the digital image is a filtered version of the target with added noise.

Due to limitations in optical component and sensor capabilities, an imaging device cannot fully capture all the details of a given target and therefore causes distortion. In signal processing, such distortion is modeled by a low-pass filter h, which suppresses the strength of the high-frequency components, such as fine texture, of the original object. The Fourier transform of h, known as the frequency response or $MTF$ (i.e., the square root of the power spectral density (PSD)), is assumed to be an intrinsic property of an imaging device. If the noise term n is accurately modeled, $MTF$ can then be measured using spectral analysis.

Noise in a digital image is unwanted intensity variations caused either by optical elements (e.g., imperfect surfaces) or the image sensor (e.g., dead/stuck/hot pixels, thermal noise). Noise caused by optical elements can be corrected by flat frames and is not the focus of this study. Temporal noise and fixed pattern noise (FPN) are two common noise types caused by the image sensor, and generally observed in captured images [12]. FPN shows a temporally-constant spatial non-uniformity, which can be eliminated with the help of a dark frame (for dark signal non-uniformity) or offset and gain for each pixel (for photo-response non-uniformity). On the other hand, temporal noise (e.g., shot noise and thermal noise) randomly fluctuates over time over the whole image and is the main noise source for most digital images. In the simplest setting, the noisy fluctuations are (1) independent and identical across the imaging plane and (2) statistically independent of the target x, such as additive white Gaussian noise (AWGN) [13]. An important property of AWGN is that its PSD is spectrally flat (i.e., same power across all frequencies). Under the assumptions above, we can express the generative model (Eq. (3)) in the frequency domain as

(4)$$PS{D_{image}} = {|{MTF} |^2}PS{D_{target}} + PS{D_{noise}}$$

Under favorable imaging conditions, the constant noise term $PS{D_{noise}}$ is negligible and the $MTF$ can be approximately calculated by Eq. (1). Image noise becomes more noticeable when the images are taken in low-light conditions and/or high ISO sensitivity settings. Dimly lit imaging conditions usually require high ISO number settings, which correspondingly lead to greater noise in the image. In this case, the $MTF$ can be determined by first subtracting the noise term $PS{D_{noise}}$ if we have the knowledge of its strength (Eq. (2)). Notice that both Eq. (1) and Eq. (2) rely on the validity of the assumptions of the generative model, and that the given image has not been further processed by the camera.

2.2 Modified model for processed images

The digital image at a camera sensor typically undergoes a series of processing steps to boost its visual quality before it is available to the user. In this section, we provide a simplified mathematical model for this processing and explain why it makes the generative model in the last section invalid. We also suggest a correction for the model.

Although the processing steps in an ISP are essentially nonlinear, we can approximate them using a linear model by allowing the filter h and noise n to be dependent on the statistical properties of x, denoted as ${h_x}$ and ${n_x}$. The generative model then becomes

(5)$$y({i,j} )= ({{h_x}\ast x} )({i,j} )+ {n_x}({i,j} )$$

This model implies that the system is locally linear and that local input patterns from x bearing similar structure are processed the same way. The first implication of this model is that the $MTF$ (frequency response of ${h_x}$) now depends on the structure of the real-world target. Therefore, to measure the $MTF$ for a natural scene, we need to pick a target which mimics the structure of the natural scene. This explains why the methods of $MT{F_{SW}}$ and $MT{F_{SE}}$ fail in practice, as a natural scene usually has more complex and random texture information than sinusoidal waves or slanted edges. The method based on a $DL$ target is more suitable, as this target has been designed to simulate natural scenes. Another implication of the model is that the noise is now correlated with the target x. Therefore, the frequency domain representation of the generative model contains an extra term of cross $PSD$ between the target and noise ($PS{D_{target - noise}}$).

(6)$$PS{D_{image}} = {|{MTF} |^2}PS{D_{target}} + PS{D_{noise}} + 2|{MTF} |PS{D_{target - noise}}$$

Due to the additional $PS{D_{target - noise}}$ term, the noise-corrected $MTF$ equation (Eq. (2)) is no longer valid. However, we can still use Eq. (1) if we can remove the noise such that $PS{D_{noise}}$ is close to zero. This approach is elaborated in the next section.

2.3 Image denoising

2.3.1 Multiple-frame averaging for denoising

To estimate the noise map of a $DL$ target’s texture region, digital data of the $DL$ target are required. However, since such data are usually not available for a commercial target, a highly denoised target image might be used to approximate the digital data of the DL target. Since the DL target consists of high frequency texture, conventional denoising techniques such as filtering can lead to the smoothening and loss of these texture details. Since multiple frames of the same scene obtained in a controlled imaging environment differ only in their temporal noise distributions, averaging them is an effective method to localize and boost high frequency image details and enhance signal-to-noise ratio (SNR). This multiple-frame approach improves feature localization in the texture region and reduces noise in the averaged image by a factor of $\sqrt N $ , where N is the total number of averaged frames. Camera device stabilization (using a tripod and burst imaging mode, for example) is required to ensure that the frames are properly aligned. Since multiple frames are captured using the same device with minimal jitter between exposures and the imager is stabilized in our study, image registration is not required. In cases such as mapping frames from different sources, or mapping a frame to a reference image, the frames can be registered before averaging to improve accuracy. These multiple frames are combined using pixel-level image averaging to get the denoised estimate of the digital image of the DL target, ${I_{avg}}$.

Image averaging can be done either using simple image averaging (${I_{avg - s}}$) or weighted image averaging (${I_{avg - w}}$). The pixel-level simple averaging of the images can be achieved by Eq. (7).

(7)$${I_{avg - s}} = \frac{1}{N}\mathop \sum \nolimits_{i = 1}^N {I_i}$$

where I_i is the two-dimensional pixel value matrix of the i^th captured frame. The pixel-level weighted averaging of these frames can be estimated by Eq. (8) [14].

(8)$${I_{avg - w}} = \mathop \sum \nolimits_{i = 1}^N {\alpha _i}{I_i} , {\textrm {with}}\; {\alpha _i} = \left( {\frac{1}{{\sigma_i^2}}} \right)/ \mathop \sum \nolimits_{i = 1}^N \frac{1}{{\sigma _i^2}}$$

where, the weighting factors ${\alpha _i}$ are inversely proportional to the noise variance $\sigma _i^2$ calculated from the gray patch in each frame if the noise level in the gray patch is the same as noise level at other regions, and $\mathop \sum \nolimits_{i = 1}^N {\alpha _i} = 1$. The weighting based on an inverse proportion to the noise variance ensures that higher weight is given to those frames which contain a lower proportion of noise with respect to the signal.

2.3.2 Single-image denoising using wavelet thresholding

A single image (e.g., an averaged image) can be further denoised. It has been shown previously that for a large class of natural images, the wavelet decomposition sub-band coefficients can be modeled by a Laplacian distribution [14,15]. Image averaging followed by wavelet thresholding has been previously utilized for image denoising [14]. The wavelet transform can be used to represent a signal or an image with a high degree of sparsity, where the main signal components would be concentrated in a few coefficients, and the information contained in the rest of the coefficients would be mainly noise. Since a $DL$ target models natural images, this provides motivation for using wavelet thresholding to denoise a $DL$ target image.

Assume Y to be the matrix of wavelet coefficients of a noisy image. An estimate ($\hat{I}$) of the original image I is obtained by thresholding the wavelet coefficients Y and then performing an inverse transform. The selected threshold value (λ) determines which portion of the signal is considered noise and removed, and therefore the final quality of the denoised image.

Two commonly used methods to calculate λ are the universal threshold (U-T), and the Birge-Massart penalized threshold (B-M). Donoho and Johnstone [16] proposed the U-T method and calculated the threshold as $\; \lambda = \hat{\sigma }\sqrt {2\textrm{log}({{N_{pix}}} )} $, where ${N_{pix}}$ is the number of pixels in the reference image, which represents the length of the series, and $\; \hat{\sigma }$ is the noise level estimate, typically evaluated by the robust median estimator in the highest sub-band ($H{H_1})$ of Y according to Eq. (9) with i=1. In this approach (U-T, single), each decomposition level is thresholded using the same value. In a second approach (U-T, Multiple), the noise level and threshold value are estimated separately for each decomposition level i using their corresponding detail coefficients HHi.

(9)$$\hat{\sigma } = \frac{{Median({|{H{H_i}} |} )}}{{0.6745}}$$

In the B-M calculation technique [17], $\lambda = |{{c_{{t^\ast }}}} |$, where ${c_i}$ are the coefficients of the image wavelet decomposition matrix ($Y$) sorted in the decreasing order of their absolute value and ${t^\ast }$is the minimizer of $f(t )$, as given by Eq. (10).

(10)$$f(t )= - \mathop \sum \nolimits_{i = 1}^t {c_i}^2 + 2{\hat{\sigma }^2}t\left( {\alpha + \log \left( {\frac{{{N_{pix}}}}{t}} \right)} \right)$$

where ${N_{pix}}$ is the number of coefficients (same as the number of pixels), t is an interger ranging from 1 to ${N_{pix}}$, $\alpha ( > 1)$ is the sparsity parameter (a high $\alpha $ value results in a high $\lambda $ value, and a sparser thresholded output), and $\hat{\sigma }$ is calculated as per Eq. (9).

For a given $\lambda $ value, the thresholding can be performed either using soft-thresholding or hard-thresholding. The theoretical justifications for the performance of soft-thresholding have been studied in detail [18]. The soft-thresholded value can be calculated as in Eq. (11),

(11)$${\eta _{\lambda ,s}}({{s_i}} )= \left\{ {\begin{array}{{c}} {{s_i} + \lambda ,\; if\; {s_i} < - \lambda }\\ {{s_i} - \lambda ,\; if\; {s_i} > \lambda }\\ {0,\; if\; |{{s_i}} |\le \lambda } \end{array}} \right.$$

where$\; {s_i}$ is the coefficient being thresholded in Y. The aforementioned ${c_i}$ coefficients are obtained by sorting ${s_i}$. The function shrinks those coefficients that have a magnitude greater than the threshold, and sets to zero those coefficients below the threshold, and provides a smoother transition. The hard-thresholded value, calculated as in Eq. (12), sets the coefficients below $\lambda $ to zero, while coefficients above $\lambda $ are unchanged.

(12)$${\eta _{\lambda ,h}}({{s_i}} )= \left\{ {\begin{array}{{c}} {{s_i},\; if\; |{{s_i}} |> \lambda }\\ {0,\; if\; |{{s_i}} |\le \lambda } \end{array}} \right.$$

2.3.3 Evaluation of denoising methods

The peak signal-to-noise ratio (PSNR) can be used as a metric to evaluate denoising performance when the reference image is available. The PSNR of an image with noise is obtained per Eq. (13).

(13)$$PSNR = 10\; \; lo{g_{10}}\left( {\frac{{{I_{max}}^2}}{{MSE}}} \right)$$

where ${I_{max}}$ is the maximum possible pixel value of the image, which is generally 255 for an 8-bit image. The mean squared error (MSE) between the noisy image I (e.g., $I$_i, $I$_avg, and $I$_avg+wav in Fig. 4) and the reference noiseless image$\; I$_ref is given by Eq. (14). The reference image and the noisy image are required to be the same size ($a \times b$ pixels) and in grayscale. A good denoising method minimizes the MSE between the reference image and the estimate.

(14)$$MSE = \frac{1}{{ab}}\mathop \sum \limits_{i = 1}^a \mathop \sum \limits_{j = 1}^b {({I({i,j} )- {I_{ref}}({i,j} )} )^2}$$

3. Proposed ${MT}{{F}_{{DL}}}$ calculation method and theoretical evaluation

In images affected by noise of considerable magnitude (e.g., under high ISO conditions), image averaging alone (one-step procedure) may be insufficient for effective noise removal. A two-step denoising procedure − image averaging followed by wavelet thresholding – might have better denoising performance. We compared the one-step and two-step procedures in terms of their denoising performance in this section. Noise was artificially added to a reference image at different controlled levels for evaluation.

3.1 $MT{F_{DL - den}}$ method

As mentioned in Section 2.2, Eq. (2) is no longer valid if the noise is correlated with the image contents (i.e., the noise levels are different at regions with different contents, as shown later in this paper). If we only consider the texture region for noise evaluation, Eq. (2) can still be used by replacing the $PS{D_{noise - gray}}$ with the $PS{D_{noise}}$ from the texture region ($PS{D_{noise - tex}}$), since this region has a similar content and thus a similar noise level within the whole region. Eq. (2) then becomes:

(15)$$MT{F_{DL - tex}} = \sqrt {({PS{D_{image}} - PS{D_{noise - tex}}} )/\; PS{D_{target}}} $$

The essential idea of Eq. (15) is to reduce the effect of noise on $MT{F_{DL}}$. However, this equation is not efficient. $MT{F_{DL}}$ can be directly calculated based on the $PSD$ of the denoised image of the texture region without calculating $PS{D_{noise - tex}}$ as follows

(16)$$MT{F_{DL - den}} = \sqrt {PS{D_{image - den}}\; /\; PS{D_{target}}} $$

where $PS{D_{image - den}}$ is the image PSD calculated from the denoised texture region of the $DL$ target image. $PS{D_{target}}$ can be calculated using previously defined model equations [7], and $PS{D_{image - den}}$ is calculated by taking the radially averaged $PSD$ of the denoised image. The flowchart to calculate $MT{F_{DL - den}}$ is shown in Fig. 2.

Fig. 2. Flowchart describing the proposed MTF_DL-den approach: (a) captured multiple frames, (b) one-step denoised image (${{\boldsymbol{I}}_{{\boldsymbol{avg}}}}$), (c) two-step denoised image (${{\boldsymbol{I}}_{{\boldsymbol{avg}} + {\boldsymbol{wav}}}}$), (d) calculated PSD_image-den, and (e) MTF_DL-den.

Download Full Size | PDF

The noisy images in Fig. 2(a) were captured with a setup as shown in Fig. 3. The imaging device was positioned on a customized holder at a distance of 85 cm from the $DL$ target. The camera holder was mounted on a translational/rotational stage that could control the angle to and the distance from the target. Illumination was controlled using a lamp dimmer and two lamps (EiKO 01960 Supreme Photoflood lamps, EiKO Global, LLC), and illumination intensity was measured using an illuminance meter (T-10A, Konica Minolta, Inc.). We compared images captured on two camera phones (Apple iPhone 5S and Google Nexus 5) and a DSLR camera (Canon EOS T3i). Color images were converted to grayscale before further processing. Further processing of the images, and related calculations, were carried out in MATLAB. The wavelet denoising algorithm was developed using MATLAB’s Wavelet Toolbox. The denoised image can be obtained through the one-step or two-step approach.

Fig. 3. Experimental setup: (left) setup image, (right) schematic.

Download Full Size | PDF

3.2 Theoretical model to establish and evaluate denoising algorithms

The most important step in the proposed $MT{F_{DL - den}}$ method is image denoising. We developed a theoretical model to determine parameters for image denoising algorithms and evaluate their performance as shown in Fig. 4. The model can accurately control the noise level in each frame so that the denoising performance can be quantitatively evaluated.

Fig. 4. Procedure for denoising performance evaluation.

Download Full Size | PDF

A reference image (${I_{ref}}$) was first obtained by averaging many high-quality image frames of a $DL$ target captured by the Canon T3i camera, at minimal JPEG compression level. Multiple noisy frames (${I_i}$, with i = 1, 2, …, N) were generated by adding different noise distributions at specific levels to ${I_{ref}}$, to model real-world frames of the $DL$ target captured at that noise level. Section 3.2.1 describes details of noisy frame generation.

These noisy frames were averaged to obtain an averaged image (${I_{avg}}$), which is called one-step denoising. Simple averaging was used since simple and weighted averaging methods did not show significant differences from each other in our study. This may be due to the similar frame capture conditions and the inaccurate calculation of weighting factors based on the gray patch instead of the texture region. The ${I_{avg}}$ underwent further denoising using wavelet thresholding whose parameters are identified in Section 3.2.2, to generate the final denoised image ${I_{avg + wav}}$, which is called two-step denoising.

For performance evaluation, the PSNR at each step of the denoising procedure was recorded. The procedure was repeated at different noise levels to simulate different ISO settings and noise capture conditions for the camera device.

3.2.1 Generation of noisy images

The noisy frames (${I_i}$, with i = 1, 2, …, N) were generated by adding noise samples from a variety of distributions to ${I_{ref}}$, to model real-world images of the $DL$ target captured with a CMOS image sensor that has non-linear operation [19]. Photon shot noise was modelled on the Poisson distribution, assuming incident photons during the exposure time equal pixel values scaled to the full-well electron capacity. The photoresponse non-uniformity (PRNU) was modeled by applying a normally distributed gain map (mean (μ) of unity and standard deviation (σ) of 0.01%) to the image. Crosstalk was modeled as the proportion of a pixel intensity value that is transferred to its neighboring pixels. For a pixel centered in a 3×3 matrix, its value was the sum of all 9 weighted pixel values with the weighting matrix being [0.02%, 4.92%, 0.02%; 5.1%, 78.4%, 5.1%; 0.12%, 6.2%, 0.12%]. The dark-current non-uniformity was modeled as a combination of both logistic (mean of zero and standard deviation of 0.05% of the maximum intensity) and uniform (ten ppm of random pixels was given uniformly distributed random numbers between zero and the maximum pixel value) distributions. Except for thermal noise, the parameters for all the aforementioned noise models were fixed based on examples in publication [19] and our own assumption.

Thermal noise was modeled by a zero mean Gaussian distribution with different levels of standard deviation (${\sigma _{thermal}}$). Since thermal noise is the most significant noise source, we evaluated the denoising effect by using different levels of ${\sigma _{thermal}}$ ranging from 0 to 20 (referred to an 8-bit image). The standard deviation of the total noise (${\sigma _{total}}$) changes with ${\sigma _{thermal}}$ as shown in Table 1.

Table 1. Values of ${{\sigma }_{{total}}}$ at different ${{\sigma }_{{thermal}}}$ levels

View Table

3.2.2 Establishing parameters for wavelet thresholding

We established parameters for wavelet thresholding based on noisy frames generated as in Section 3.2.1. On comparing different threshold estimation techniques, the B-M method gave a better PSNR compared to the U-T method (Fig. 5a). For the B-M method, we evaluated the effects of the sparsity parameter $\alpha $ ($\alpha > 1$), thermal noise standard deviation ${\sigma _{thermal}}$ (with mean noise $\mu $ = 0), and families of wavelets used (Symmlet-6 (sym6), Symmlet-8 (sym8), Daubechies-4 (db4), Daubechies-10 (db10), and Coiflet-5 (coif5)) [20–22] on the PSNR of denoised images. The output PSNR was observed to decrease steadily with increasing $\alpha $ values, for every wavelet considered, with ${\sigma _{total}}$ ranging from 7.7 to 20 (results not shown). Consequently, we used $\alpha = 1.05$ for further experiments in this study.

Fig. 5. Parameters for wavelet thresholding: (a) B-M versus U-T methods, with ${{\boldsymbol{\sigma}}_{\boldsymbol{total}}}$=8.1; (b) denoised PSNR with different wavelet families at different wavelet decomposition levels, with ${{\boldsymbol{\sigma}}_{\boldsymbol{total}}}$=8.1; (c) denoised PSNR at different wavelet decomposition levels with ${{\boldsymbol{\sigma}}_{\boldsymbol{total}}}$ ranging from 7.7 to 20 (Coiflet-5 wavelet).

Download Full Size | PDF

Figure 5b displays the denoised image PSNR obtained using multiple wavelet families over a range of wavelet decomposition levels. Since the coif5 wavelet family gave slightly better PSNR values compared to other wavelet families for the level 2 wavelet decomposition, we chose coif5 as the denoising wavelet for further steps. Figure 5c shows the denoised image PSNR for multiple wavelet decomposition levels with the thermal noise ${\sigma _{total}}$ ranging from 7.7 to 20. It was observed that the noise intensity present in the image played a role on which wavelet decomposition level gave the maximum output PSNR. The level 2 wavelet decomposition gave optimal denoising performance when ${\sigma _{total}}$ was less than 20. Accordingly, a level selection step can be implemented in the denoising algorithm according to the estimated noise intensity present in the image. Since the expected ${\sigma _{total}}$ of most captured images are less than 20, we implemented level-2 decomposition in the results presented.

In the case of thresholding procedure selection, soft thresholding performed better (results not shown), probably because the shrinkage of the wavelet coefficients tends to better preserve the high frequency texture detail in the DL target that might otherwise be removed in the case of hard thresholding.

In summary, we established the following wavelet denoising parameters for our study: (a) B-M threshold selection strategy with $\alpha $ value of 1.05, (b) level-2 decomposition, (c) Coiflet-5 wavelet, and (d) soft thresholding.

3.2.3 Performance of the one- and two-step denoising procedures

Based on the established theoretical model, Fig. 6 plots the improvement in PSNR values for the one-step and two-step denoising procedures. The parameters established in Section 3.2.2 were used in this section unless specified. For each noise level considered, the improvement in PSNR from the baseline PSNR₁ (i.e., PSNR of ${I_1}$) was calculated for each procedure. Since the simulated noisy frames were obtained by adding noise to the reference image in the same way, the baseline PSNR from any frame ${I_i}$ (i = 1, 2, …, N) should be the same. We evaluated the one- and two-step denoising procedures based on reference images from the Canon T3i, iPhone 5S and Nexus 5. From Fig. 6, as the added noise intensity level in the simulated noisy frames increased, so did the PSNR improvement due to wavelet thresholding.

Fig. 6. Improvement in PSNR of the denoised image with image averaging (N = 10) and wavelet thresholding for different cameras, where PSNR₁, PSNR_avg and PSNR_avg+wav are the PSNR values for I₁, I_avg and I_avg+wav, respectively (Fig. 4) based on Eqs. (13), (14).

Download Full Size | PDF

Figure 7 compares the noise removed from ${I_1}$ based on one-step (${I_1} - {I_{avg}}$) or two-step (${I_1} - {I_{avg + wav}}$) procedures as a function of the noise added to ${I_1}$ (${I_1} - {I_{ref}}$). The reference line indicates when the removed noise is the same as the added noise. From the figure, the two-step procedure lines almost overlap with the reference lines, demonstrating more accurate denoising.

Fig. 7. Removed noise versus added noise in terms of ${\sigma _{total}}$ based on 10 noisy images: (a) Canon T3i; (b)iPhone 5S; (c) Nexus 5.

Download Full Size | PDF

3.3 Fixed pattern noise evaluation

While this paper mainly focuses on temporal noise, fixed pattern noise (FPN) is another noise component which may influence $PS{D_{noise}}$. Temporally uniform FPN was not considered in our study since it was significantly smaller than temporal noise, as shown below. We used dark frame images to estimate FPN at various camera ISO levels. Thirty dark frame images were captured and averaged to eliminate temporal noise. The residual noise in the averaged image represents the FPN. Figure 8 illustrates the calculated PSD of FPN ($PS{D_{FPN}}$). The $PS{D_{FPN}}$ was two orders of magnitude lower than the dominant temporal noise components (Fig. 9) and thus was considered negligible in this study.

Fig. 8. ${{\boldsymbol{PSD}}_{\boldsymbol{FPN}}}$ calculated from dark frame images captured on (a) Canon T3i; (b) iPhone5S; (c) Nexus 5.

Download Full Size | PDF

Fig. 9. PSD_noise calculated for the gray patch and the texture regions, captured (N = 10) using the a) Canon T3i (50 lx); b: Canon T3i (500 lx); c) iPhone 5S (JPEG); d) Nexus 5 (JPEG).

Download Full Size | PDF

4. Results and discussion

Using the two-step denoising procedure and parameters outlined in Sections 3, we compared $PS{D_{noise - gray}}$ with $PS{D_{noise - tex}}$ and the $MT{F_{DL - den}}$ approach with other $MTF$ approaches. Instead of simulated noisy images in Sections 3, images captured with a DSLR camera (Canon T3i) and two mobile phones (iPhone 5S and Nexus 5) were used in this section, except for Section 4.3 where simulated noisy images were used. The effects of the number of noisy images on denoising performance was also discussed.

4.1 Comparison of $PS{D_{noise - gray}}$ and $PS{D_{noise - tex}}$

We examined the $PS{D_{noise}}$ obtained from the gray patch ($PS{D_{noise - gray}}$) and the texture region ($PS{D_{noise - tex}}$), for both processed JPEG and unprocessed RAW images. The hypothesis was that the noise levels in the gray and texture regions would be the same for the RAW image but different for the processed JPEG image since the ISP might process different scene regions differently. Since it was difficult to obtain RAW images from the tested camera phones, we used the Canon T3i DSLR camera for this purpose. Instead of simulated noisy images in Section 3, all noisy images in this section were captured with cameras if not specified.

The $PS{D_{noise - tex}}$ and $PS{D_{noise - gray}}$ were obtained from the noise maps of the texture region and the gray patch region respectively. The noise map of the texture region was calculated using the equation of ${n_i} = {I_i} - {I_{avg + wav}}$ . While the noise map of the gray patch region can also be calculated using the same equation, the original algorithm [7] calculated $PS{D_{noise - gray}}$ from the gray patch of a single image. Since the $PS{D_{noise - gray}}$ values from both approaches were approximately the same (results not shown), we calculated $PS{D_{noise - gray}}$ from a single image, in alignment with the original algorithm. For the Canon T3i DSLR camera, RAW files of .cr2 format were converted to minimally processed RAW TIFF files, using the dcraw program (https://www.cybercom.net/∼dcoffin/dcraw/). The $PS{D_{noise - gray}}$ and $PS{D_{noise - tex}}$ results from the RAW files were compared with the results from the JPEG files that were obtained with minimal compression and the same pixel counts.

Figure 9 provides a summary of the $PS{D_{noise - gray}}$ and $PS{D_{noise - tex}}$ data for RAW and JPEG images under illumination levels of 50 lx and 500 lx for different cameras. There were noticeable differences in the estimated $PS{D_{noise - gray}}$ and $PS{D_{noise - tex}}$ values between the RAW and JPEG images. The observed $PS{D_{noise - gray}}$ and $PS{D_{noise - tex}}$ of RAW files coincided except that the $PS{D_{noise - gray}}$ curves had more fluctuations, indicating $PS{D_{noise - gray}}$ was less robust to noise. This matches our theoretical expectation that the RAW images do not undergo considerable image processing. In the case of the JPEG images, $PS{D_{noise - gray}}$ was lower than $PS{D_{noise - tex}}$, indicating that the texture region in the image was processed differently than gray regions. The difference in $PS{D_{noise - gray}}$ and $PS{D_{noise - tex}}$ show that the noise was correlated with image content and therefore, the approximation of $PS{D_{noise - tex}}$ by $PS{D_{noise - gray}}$ was not accurate. From the figure, it can be seen that $PS{D_{noise}}$ across a major portion of the frequency spectrum was underestimated using the gray region, relative to the texture region. The magnitude of intensity difference was higher at an illumination level of 50 lx than at 500 lx. The images obtained from the mobile phones showed similar trends in values as observed for the DSLR camera.

4.2 Comparison of the $MT{F_{DL - den}}$ approach with other $MTF$ approaches

Figure 10 displays a comparison between the proposed $MT{F_{DL - den}}$_, $MT{F_{DL - gray}}$ obtained using the noise estimated from the gray region [7], $MT{F_{DL - Cao}}$ obtained without noise correction [6], and $MT{F_{SE}}$ based on a slanted-edge target [5]. $MT{F_{DL - den}}$ and $MT{F_{DL - gray}}$ overlapped for RAW images (Fig. 10a, b), indicating minimal image processing and thus the same amount of noise in the gray patch and the texture regions. The $MT{F_{DL - Cao}}$ was significantly higher than $MT{F_{DL - den}}$ and $MT{F_{DL - gray}}$ for RAW images at both illumination intensities, especially at high frequencies. This is because the high frequency noise and artifacts caused an artificial increase in $MTF$. For JPEG images at both illumination intensities, the $MT{F_{DL - den}}$ was lower than the $MT{F_{DL - gray}}$, indicating that the gray patch and the texture regions were processed differently and thus their estimated noise levels were different. Since modern camera devices denoise planar regions better than texture regions, the noise level in the texture region is usually higher than that in the gray patch region for JPEG images. Figure 10 also showed that the $MT{F_{SE}}$ curve was lower than other curves for RAW images, likely because the quality of the slanted-edge target was not high enough and high frequency information in the images were missing. The differences in $MT{F_{DL - den}}$ between different mobile devices was also observable in the results. For most JPEG images, the $MT{F_{DL - Cao}}$ and $MT{F_{DL - gray}}$ curves were close or overlapped, indicating that most cameras did a good job in removing noise in the gray patch region and thus minimized $PS{D_{noise - gray}}$ in Eq. 2.

Fig. 10. MTF comparison for captured images with Canon T3i, iPhone 5S and Nexus 5 at illumination intensities of 50 lx and 500 lx.

Download Full Size | PDF

The ISPs in these devices also carried out edge detection, and in certain cases, artificially improved the $MT{F_{SE}}$ (Fig. 10 c, d, f, g, h). For JPEG images, the $MT{F_{SE}}$ was boosted by edge enhancement during the image processing, especially for low frequencies. The edge-intensity profiles of the slanted-edge images (Fig. 11) gave further information about the level of sharpening carried out by the ISPs of the tested cameras. The intensity values were normalized such that the ‘zero’ and ‘one’ values corresponded to locations distant from the edge in the dark and light regions, respectively. The artificially improved edge contrast for processed JPEG images explains the higher $MT{F_{SE}}$ compared to minimally processed RAW images. The effect of such selective processing was far less noticeable for any variant of $MT{F_{DL}}$, as the texture region was relatively unaffected by the edge-enhancement and denoising algorithms employed in camera devices.

Fig. 11. Intensity profiles of slanted-edge images for RAW and JPEG formats (inset): (a) 50 lx, and (b) 500 lx.

Download Full Size | PDF

Since our two-step approach denoised the texture region effectively, the artificial increase of $MT{F_{DL}}$ at high frequencies by noise was avoided and thus the $MT{F_{DL - den}}$ was more accurate than $MT{F_{DL - Cao}}$ and $MT{F_{DL - gray}}$.

4.3 Effects of the number of noisy images on denoising performance

The aforementioned results for one-step and two-step denoising procedures were based on averaging 10 noisy images. The total number of images used for averaging can also affect the results. We studied the effects of the number of noisy images on denoising performance using simulated noisy images. Since the noise level in these simulated images are known, we can accurately evaluate the denoising effect of different procedures.

Figure 12 displays the improvement in PSNR of denoised images on changing the number of simulated noisy frames utilized ($N$). Simulated total noise in each set of frames was at the specified intensity level, and the PSNR obtained at each step of the denoising procedure was recorded. As expected, utilizing more frames in the averaging step gave a higher PSNR for the denoised image. Comparison of Fig. 12 (a) to (c) versus (d) to (f) showed that the improvement in PSNR due to wavelet thresholding was larger for the higher noise intensity (${\sigma _{total}} = 12.6$) and the lower number of averaged images. For ${\sigma _{total}} = 8.1$, the thresholding didn’t help for Canon T3i and iPhone 5S.

Fig. 12. Variation in PSNR of the denoised image with the number of images (N) used for averaging.

Download Full Size | PDF

Figure 13 shows similar data as Fig. 7, with the difference being that while Fig. 7 is based on 10 simulated noisy images, Fig. 13 is based on 50. Plots in Fig. 13 show that the performance of the one-step procedure is close to that of the two-step procedure when utilizing a large number a frames.

Fig. 13. Removed noise versus added noise in terms of ${\sigma _{total}}$based on 50 noisy images: (a) Canon T3i; (b) iPhone 5S; (c) Nexus 5.

Download Full Size | PDF

The required number of continuously captured frames depends on the camera model, illumination intensity, and noise level present in the images. If a sufficiently high number of frames are captured for averaging and the captured frames have the same quality and noise level, the wavelet thresholding step may be unnecessary. However, these preconditions should be verified before removing the wavelet thresholding step. For some camera phones, fast continuous shooting rate (commonly known as burst rate) can possibly reduce image quality.

5. Conclusions

With increasing use of digital image processing in CPMDs for medical imaging and tele-dermatology, the development of effective test methods for evaluation of texture reproduction becomes important. In this paper, we have shown that the noise in a DL target image was correlated with the image contents, with measurements from a gray patch tending to underestimate noise levels in more heterogeneous regions. Accordingly, we proposed and evaluated a modification to the $MT{F_{DL - gray}}$ algorithm involving denoising the DL target image and calculating the PSD of the denoised image ($PS{D_{image - den}}$) directly.

The two-step $MT{F_{DL - den}}$approach includes image averaging followed by wavelet thresholding, to effectively denoise the texture region of a DL target image. We mathematically and experimentally validated the approaches through simulation of CMOS sensor noise utilizing a mix of possible noise distributions, comparison of removed noise with added noise, and inspection of noise spectra calculated from uniform gray and heterogeneous texture regions. Results show that the approach improved the accuracy of texture reproduction evaluation and was robust to noise. An advantage of $MT{F_{DL - den}}$ over full-reference methods is that it is possible to calculate the $MT{F_{DL}}$ without necessarily knowing the source target digital data. The one-step $MT{F_{DL - den}}$approach that only includes image averaging can also be used if a significant number of images (e.g., more than 50) can be captured with the same quality and noise level.

The proliferation of CPMDs brings with it the need for identification and standardization of best practices for device testing. As a DL target simulates the texture information in a natural scene, our findings indicate that the proposed $MT{F_{DL - den}}$ approach may be valuable for evaluating texture reproduction of CPMD images.

Funding

U.S. Food and Drug Administration (FDA) (CPOSEL12).

Acknowledgements

The project was supported by U.S. Food and Drug Administration’s Critical Path funding (CPOSEL12). Prof. Yu Chen acknowledges support from the NSF/FDA Scholar-In-Residence Program (CBET-1641077).

Disclosures

The mention of commercial products, their sources, or their use in connection with material reported herein is not to be construed as either an actual or implied endorsement of such products by the U.S. Department of Health and Human Services. The authors declare no conflict of interest.

References

1. L. de Greef, M. Goel, M. J. Seo, E. C. Larson, J. W. Stout, J. A. Taylor, and S. N. Patel, “Bilicam: using mobile phones to monitor newborn jaundice,” inProceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing (ACM, 2014), pp. 331–342.

2. J. Weingast, C. Scheibböck, E. M. Wurm, E. Ranharter, S. Porkert, S. Dreiseitl, C. Posch, and M. Binder, “A prospective study of mobile phones for dermatology in a clinical setting,” J. Telemed. Telecare 19(4), 213–218 (2013). [CrossRef]

3. J. A. Wolf, J. F. Moreau, O. Akilov, T. Patton, J. C. English, J. Ho, and L. K. Ferris, “Diagnostic inaccuracy of smartphone applications for melanoma detection,” JAMA Dermatol. 149(4), 422–426 (2013). [CrossRef]

4. FDA, “Mobile Medical Applications: Guidance for Industry and Food and Drug Administration Staff,” (U.S. Food and Drug Administration, https://www.fda.gov/media/80958/download, 2015).

5. ISO, “ISO 12233: Photography - Electronic still picture imaging - Resolution and spatial frequency responses,” (the International Organization for Standardization, 2017).

6. F. Cao, F. Guichard, and H. Hornung, “Dead leaves model for measuring texture quality on a digital camera,” Proc. SPIE 7537, 75370E (2010). [CrossRef]

7. J. McElvain, S. P. Campbell, J. Miller, and E. W. Jin, “Texture-based measurement of spatial frequency response using the dead leaves target: extensions, and application to real camera systems,” Proc. SPIE 7537, 75370D (2010). [CrossRef]

8. A. B. Lee, D. Mumford, and J. Huang, “Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model,” Int. J. Comput. Vision 41(1/2), 35–59 (2001). [CrossRef]

9. N. Azzabou, N. Paragios, and F. Guichard, “Uniform and textured regions separation in natural images towards MPM adaptive denoising,” in International Conference on Scale Space and Variational Methods in Computer Vision (Springer, 2007), pp.418–429.

10. L. Kirk, P. Herzer, U. Artmann, and D. Kunz, “Description of texture loss using the dead leaves target: current issues and a new intrinsic approach,” Proc. SPIE 9023, 90230C (2014). [CrossRef]

11. U. Artmann, “Image quality assessment using the dead leaves target: experience with the latest approach and further investigations,” Proc. SPIE 9404, 94040J (2015). [CrossRef]

12. J. Nakamura, Image sensors and signal processing for digital still cameras (CRC Press, 2017).

13. J. R. Barry, E. A. Lee, and D. G. Messerschmitt, Digital communication (Springer Science & Business Media, 2012).

14. S. G. Chang, B. Yu, and M. Vetterli, “Wavelet thresholding for multiple noisy image copies,” IEEE Trans. Image Process. 9(9), 1631–1635 (2000). [CrossRef]

15. S. G. Mallat, “A theory for multiresolution signal decomposition: the wavelet representation,” IEEE Trans. Pattern Anal. Mach. Intell. 11(7), 674–693 (1989). [CrossRef]

16. D. L. Donoho and J. M. Johnstone, “Ideal spatial adaptation by wavelet shrinkage,” Biometrika 81(3), 425–455 (1994). [CrossRef]

17. L. Birgé and P. Massart, “Gaussian model selection,” J. Eur. Math. Soc. 3(3), 203–268 (2001). [CrossRef]

18. D. L. Donoho and I. M. Johnstone, “Adapting to unknown smoothness via wavelet shrinkage,” J. Am. Stat. Assoc. 90(432), 1200–1224 (1995). [CrossRef]

19. R. D. Gow, D. Renshaw, K. Findlater, L. Grant, S. J. McLeod, J. Hart, and R. L. Nicol, “A comprehensive tool for modeling CMOS image-sensor-noise performance,” IEEE Trans. Electron Devices 54(6), 1321–1329 (2007). [CrossRef]

20. G. Beylkin, R. Coifman, and V. Rokhlin, “Fast wavelet transforms and numerical algorithms I,” Commun. Pure Appl. Math. 44(2), 141–183 (1991). [CrossRef]

21. I. Daubechies, “Orthonormal bases of compactly supported wavelets,” Commun. Pure Appl. Math. 41(7), 909–996 (1988). [CrossRef]

22. I. Daubechies, Ten lectures on wavelets (Society for Industrial and Applied Mathematics, 1992).

$σ_{t h e r m a l}$	0.0	2.5	5.0	7.5	10.0	12.5	15.0	17.5	20.0
$σ_{t o t a l}$	7.7	8.1	9.2	10.7	12.6	14.7	16.9	19.1	21.4

Improved texture reproduction assessment of camera-phone-based medical devices with a dead leaves target

Abstract

1. Introduction

1.1 Performance of camera phones for medical image acquisition

1.2 ‘Dead leaves’ (DL) targets and algorithms for texture reproduction assessment

1.3 Scope of this study

2. Generative models of images and denoising methodologies

2.1 Generative model of raw images

2.2 Modified model for processed images

2.3 Image denoising

2.3.1 Multiple-frame averaging for denoising

2.3.2 Single-image denoising using wavelet thresholding

2.3.3 Evaluation of denoising methods

3. Proposed ${MT}{{F}_{{DL}}}$ calculation method and theoretical evaluation

3.1 $MT{F_{DL - den}}$ method

3.2 Theoretical model to establish and evaluate denoising algorithms

3.2.1 Generation of noisy images

3.2.2 Establishing parameters for wavelet thresholding

3.2.3 Performance of the one- and two-step denoising procedures

3.3 Fixed pattern noise evaluation

4. Results and discussion

4.1 Comparison of $PS{D_{noise - gray}}$ and $PS{D_{noise - tex}}$

4.2 Comparison of the $MT{F_{DL - den}}$ approach with other $MTF$ approaches

4.3 Effects of the number of noisy images on denoising performance

5. Conclusions

Funding

Acknowledgements

Disclosures

References

Cited By

Figures (13)

Tables (1)

Equations (16)

OSA Continuum