Dark2Light: multi-stage progressive learning model for low-light image enhancement

Rui-Kang Li; Meng-Hao Li; Shi-Qi Chen; Yue-Ting Chen; Zhi-Hai Xu

doi:10.1364/OE.507966

1. Introduction

Photography under sub-optimal lighting conditions, such as back-lit and low-light, leads to a series of degradation in low-light images. To cope with this issue, researchers have proposed several methods to enhance low-light images. Traditional image post-processing methods, such as Histogram Equalization methods [1,2] and Retinex-based methods [3–5], primarily focus on illumination enhancement to expand dynamic range and increase image contrast. However, due to traditional methods’ reliance on hand-crafted priors and limitations in generalization abilities, severe noise, and unnatural artifacts will appear in the results after processing.

Recently, several deep-learning based methods [6–8] have been proposed to solve low-light image enhancement (LLIE) problems. Since deep neural networks can effectively learn the latent mapping between pairs of low-light and normal-light images, deep-learning methods become a major trend of LLIE research. However, due to the limited number of photons and inescapable noise, most of them tend to bias one side of the smoothness or details, such as satisfactory visual quality but over smooth, or rich in details but generating unnatural artifacts.

To this end, we look forward to designing a robust learning model that can simultaneously achieve illumination enhancement and noise removal. however, noise in low-light images can be expressed as a combination of shot noise, read noise, row noise, and quantization noise [9]. Without the rigorous calibration process adopted in [10,11], it is difficult for the network to effectively learn the noise distribution. Not only is the difference between noise and signal in low-light regions so small that it is difficult to characterize, but the noise will be amplified when enhancing illumination. Recently, multi-stage models have shown their superiority in the fields of exposure correction [6], image deblurring [12–14], image deraining [15], and image denoising [16]. Motivated by this learning mechanism, we determine to exploit the advantages of multi-stage models to capture intrinsic illumination adjustment and fine details recovery of low-light images [17].

In this paper, we propose a multi-stage progressive learning model for LLIE, namely Dark2Light. Different from existing deep LLIE methods learn an end-to-end mapping based on single-stage design, our method formulates the LLIE problem as two main problems within each stage: (1) illumination enhancement and (2) noise removal.

First, we convert the low-light images from sRGB space to linear space. Then we employ a contextual transformer block to estimate illumination enhancement. Since convolutional layers with static weights can lead to large discrepancies between trained and inferred models, we insert context blocks with different receptive fields into the transformer blocks, aiming to adaptively learn illumination mapping curves for each independent region in low-light images. Second, we adopt a simple U-Net shaped encoder-decoder architecture for image denoising. With the guidance of dual-supervised attention modules between each stage, the denoising block can effectively suppress noise. After quantitative and qualitative experiments, our method achieves the best results among state-of-the-art deep LLIE methods, especially in refined details and less noise. We also provide detailed ablation studies to verify the effectiveness of our method.

Overall, our contributions are summarized as follows:

• We formulate an image formation model from photons, RAW to sRGB images both in Low-Light and Normal-Light cases. After theoretical derivation and experimental proof, we find that training using inverse gamma correction pairs is more beneficial to LLIE.
• We propose a novel multi-stage model for LLIE that learns illumination enhancement and noise removal alternatively and progressively. To the best of our knowledge, Dark2Light is the first multi-stage model to handle the LLIE problem.
• We formulate a dual-supervised attention block that can effectively refine supervised features before propagating further between each two stages.

2. Related works

As a hot research topic in the past several years, a large number of LLIE methods have been proposed. Traditional methods for LLIE include the Histogram Equalization (HE)-based methods and the Retinex theory-based methods. HE is an algorithm that recalculates the gray-scale probability distribution through the cumulative distribution function of the original image to maximize the image entropy, and it was widely used after being recorded in 1987 [18]. To solve its shortage of detail loss and over-enhancement, many improvement methods have also been proposed [1,2,19–21]. The most famous one that has been widely used commercially is CLAHE [22], which divides the image into multiple overlapping blocks and then performs histogram calculations. Although the HE-based methods are simple and fast to implement, there are still problems such as detail loss, massive noise, and color distortion in the enhanced image.

A typical Retinex theory-based method [3] decomposes a low-light image into a reflection map and an illumination map through prior or regularization. The single-scale [4] and multi-scale [5] Retinex algorithms directly use the reflection map as the result, resulting in halos and color casts in the image. In order to better decompose images, Kimmel et al. [23] first introduce the variational model into Retinex. Later, the total variation suppression model was also introduced by Ng et al. [24] to obtain a smoother illumination map. However, Retinex-based methods use Gaussian convolution templates for illumination estimation and do not have the ability to preserve edges, so they may cause halos in certain areas or over-enhance the entire image.

Recent years have witnessed the massive success of learning-based methods for LLIE since the first work namely LLNet [25]. Chen et al. [26] combine the U-Net with the ISP pipeline to optimize an end-to-end network. Zhang et al. [27] design three sub-nets based on Retinex theory to learn the illumination map, namely KinD. As a representative of unsupervised-learning methods, EnlightenGAN [28] uses an attention-guided U-Net as the generator and a global-local discriminator to optimize enhanced images. Zero-DCE [6] estimates the mapping curve between low-light and normal-light image pairs by optimizing four exquisite losses. Yang et al. [29] introduce the long-short memory mechanism and propose a semi-supervised network to enhance low-light images. Xu et al. [30] combine an SNR-aware transformer model and a convolutional model to adaptively enhance low-light images in a spatial-varying manner.

However, most of the methods mentioned above have several biases, such as satisfactory visual quality but over smooth, or rich in details but generating unnatural artifacts. In this work, we design a multi-stage learning model aiming at well-handling both illumination enhancement and noise removal.

3. Proposed method

First, we introduce the detailed procedures of the physical image formation model under normal-light and low-light cases in Section 3.1. Then, the overall paradigm of Dark2Light is detailed in Section 3.2, where we illustrate the multi-stage progressive learning model from overall architecture to the design of each distinct block.

3.1 Preliminaries

In order to restore from low-light images to normal-light images, we need to fully understand the mechanism of image degradation in low-light conditions. To begin with, we consider the physical image formation model in the real world. Different from simply considering the decomposition of sRGB images based on Retinex theory, we’d like to look straight into the image processing pipeline from photons, raw image to sRGB image.

Up to now, the ground truth (GT) of low-light image enhancement remains unavailable and controversial, but the widely used one considered to be GT is captured at low ISO and long exposure settings, while the corresponding low-light pair is captured at high ISO and short exposure settings. Here, we consider the image formation model under these two settings, referred to as the Normal-Light case and Low-Light case respectively, as shown in the bottom of Fig. 1.

Fig. 1. Image Formation Model. (a) Poisson distribution over expected number of photons $p^{*}$, (b) Raw image pairs $R_{n}\text {-}R_{l}$, and (c) sRGB image pairs $I_{n}\text {-}I_{l}$.

Download Full Size | PDF

3.1.1 Raw image formation model

A camera sensor converts the photons hitting the pixel area during the exposure time into a digitized map of light intensity, known as a raw image $R$. Let’s first consider the ideal image formation model $R^{*}$ with no noise:

(1)$$R^{*} = g\alpha*p^{*},$$

where, $g$ is the analog gain, $\alpha$ is the quantum efficiency factor and $p^{*}$ is the expected number of photons hitting the camera sensor.

Due to the quantum nature of light, there exists an inevitable uncertainty in the number of collected photons, such uncertainty imposes a Poisson distribution over $p$, which follows $p \sim \mathcal {P}(p^{*}) = \mathcal {P}(\frac {R^{*}}{g\alpha })$, as shown in Fig. 1(a). Meanwhile, depending on the circuit design and processing pipeline, a variety of noises appear during the photon-to-electrons and electrons-to-voltage stages, such as dark current noise, thermal noise, row noise and banding pattern noise etc. Here, we combine the multiple noise into a compound noise namely as the read noise $N_{r} \sim \mathcal {N}(0, \sigma _{r}^{2})$, which follows the Gaussian distribution. Considering the real situation, the ideal image formation model in Eq. (1) can be modified as:

(2)$$R \sim g\alpha*\mathcal{P}(\frac{R^{*}}{g\alpha}) + \mathcal{N}(0, \sigma_{r}^{2}),$$

where, $R^{*}$ is the ideal raw image, $\sigma _{r}^{2}\textbf {}$ is the variance of Gaussian distribution.

Previous study [31] shows that in common cases a usual simplification is to treat the Poisson distribution $\mathcal {P}(\lambda )$ as a Gaussian distribution $\mathcal {N}(\lambda, \lambda )$. So, the practical image formation model $R$ in Eq. (2) can be formulated as:

(3)$$R = R^{*} + N_{p} + N_{r},$$

where, $N_{p}$ is the photon noise, follows $N_{p} \sim \mathcal {N}(0, g\alpha R^{*})$. $N_{r}$ is the read noise, follows $N_{r} \sim \mathcal {N}(0, \sigma _{r}^{2})$.

Furthermore, the raw image formation model under Normal-Light case and Low-Light case can be formulated as:

(4)$$\left\{ \begin{array}{l} R_{n} = KR^{*},\\ R_{l} = R^{*} + N_{p} + N_{r},\\ \end{array} \right.$$

where, $K$ is the overall gain between two separate cases under different ISO and exposure time, $R_{n}$ represents the model under Normal-Light case and $R_{l}$ represents the model under Low-Light case. The visual comparison of $R_{n}\text {-}R_{l}$ can be seen in Fig. 1(b). As for the Normal-Light case, $N_{p}$ and $N_{r}$ are extremely small due to their low ISO and long exposure setting, so we set them to be zero.

3.1.2 sRGB image formation model

From raw images to sRGB images, the necessary process is well-known as the imaging signal processing (ISP) pipeline, which includes a series of modules such as black level compensation, demosaicing, white balance, gamma correction, tone mapping and etc. Here, we only consider modules that lead to nonlinear enhancement of illumination, specifically gamma correction and tone mapping. Under common situations, the mapping curve for tone mapping is the same as the gamma curve, except for local tone mapping. To make the discussion clearer, we omit tone mapping and only consider gamma correction. Gamma correction is designed to be consistent with the perception of the human eye. After the ISP pipeline, the sRGB image formation model is as follows:

(5)$$\left\{ \begin{array}{l} I_{n} = (KR^{*})^{\gamma},\\ I_{l} = (R^{*} + N_{p} + N_{r})^{\gamma},\\ \end{array} \right.$$

where, $\gamma$ is the gamma correction curve factor, $I_{n}$ represents the model under Normal-Light case and $I_{l}$ represents the model under Low-Light case. The visual comparison of $I_{n}\text {-}I_{l}$ can be seen in Fig. 1(c). To make the equation more straightforward, we omit the ISP which actually exists.

From Eq. (5), we can infer the latent mapping between $I_{n}$ and $I_{l}$ as follows,

(6)$$I_{n}^{\frac{1}{\gamma}} = KI_{l}^{\frac{1}{\gamma}} - KN_{p} - KN_{r},$$

From Eq. (6), we can speculate that compared to the latent mapping for sRGB pairs $I_{n}\text {-}I_{l}$, the latent mapping for inverse gamma correction pairs $I_{n}^{1/\gamma }\text {-}I_{l}^{1/\gamma }$ seems more easier for the model to learn. We conduct comparative experiments on LLIE using the same model with and without the gamma correction priors. In addition, the gamma curve we use is based on the Rec.709 standard. As shown in Fig. 2(a), we calculate the mapping curves for 485 image pairs in LoL training set, both mapping curves for $I_{n}\text {-}I_{l}$ and $I_{n}^{1/\gamma }\text {-}I_{l}^{1/\gamma }$ reveal a concave tendency with visually negative second derivative. But after the inversion of gamma correction, the mapping curves for $I_{n}^{1/\gamma }\text {-}I_{l}^{1/\gamma }$ show a more uniform tendency and follow a relatively consistent distribution. Moreover, Fig. 2(b)(c) show that training with $I_{n}^{1/\gamma }\text {-}I_{l}^{1/\gamma }$ have better performance (lower NIQE, higher PSNR) and lower training loss than training with $I_{n}\text {-}I_{l}$. Therefore, instead of simply using original image pairs for training, we feed inverse gamma correction pairs $I_{n}^{1/\gamma }\text {-}I_{l}^{1/\gamma }$ to our multi-stage model, which could realize better performance.

Fig. 2. Comparative experiments on LLIE using the same model(w and w/o gamma correction). (a) mapping curves for $I_{n}\text {-}I_{l}$ and $I_{n}^{1/\gamma }\text {-}I_{l}^{1/\gamma }$ in LoL training set. (b) Loss reduction maps. (c) LLIE performance in NIQE and PSNR.

Download Full Size | PDF

3.2 Dark2Light

Based on the above analysis, we propose a multi-stage network, Dark2Light, for the purpose of learning illumination enhancement and noise removal of the low-light image through an alternatively and progressively learning strategy. Within each stage, Dark2Light contains three blocks: contextual transformer block (CTB) for illumination enhancement, U-Net shaped denoising block (DB), and dual-supervised attention block (DAB) for transferring and supervising features between different sub-blocks. According to Eq. (6), if we want the model to learn the latent mapping between pairs $I_{n}^{1/\gamma }\text {-}I_{l}^{1/\gamma }$, it is essential to learn $K$ and $(N_{p}+N_{r})$ in advance. Here, we design the CTB to learn $K$ for illumination enhancement and the DB to learn $(N_{p}+N_{r})$ for noise removal. We put CTB in front of DB because we don’t have ground truth to calculate the loss of pair $I_{l}^{1/\gamma }\text {-}(I_{l}^{1/\gamma }+N_{p}+N_{r})$. Also, in order to prevent excessive amplification of noise after illumination enhancement, we adopt a multi-stage framework to gradually learn the restoration. At the end of each stage, we use the DAB to calculate pixel losses and fuse deep features from CTB and DB, as illustrated in Fig. 3.

Fig. 3. Architecture of the Dark2Light.

Download Full Size | PDF

3.2.1 Overall pipeline

Given a low-light image $I_{l}\in \mathbb {R}^{HW\times 3}$, Dark2Light first convert it to $I_{l}^{1/\gamma }$ using inverse gamma correction. Then, Dark2Light extracts the feature embeddings $F_{0}\in \mathbb {R}^{HW\times C}$ using a 3$\times$3 convolution layer, where $HW$ denotes the spatial dimension and $C$ is the number of channels. Next, the feature $F_{0}$ pass through $N$ number of stages, yielding deep features $F_{N}\in \mathbb {R}^{HW\times C}$. Within each individual stage, $F_{n}$ passes CTB and DB in turn, yielding two deep features $\overline {F_{N}}\in \mathbb {R}^{H W\times C}$ and $\overline {\overline {F_{N}}}\in \mathbb {R}^{HW\times C}$, where we use these two features for supervision to calculate learning loss in DAB. Last, we use another 3$\times$3 convolution layer to get $I_{n}^{1/\gamma }$ and then adopt gamma correction to get $I_{n}\in \mathbb {R}^{HW\times 3}$. In addition, DB is a U-Net architecture contains several residual blocks, pixel-shuffle/unshuffle [32] operations and skip connections [33].

3.2.2 Contextual transformer block

As described in Fig. 2(a), there exists a latent mapping curve between pairs of low-light and normal-light images, showing a negative second derivative trend. Previous traditional methods such as histogram equalization, homomorphic filtering have no learnable parameters, so they can’t adjust the mapping curve adaptively according to different images. Guo et. al [6] proposed a deep curve estimation (DCE) method, which is able to approximate pixel-wise and higher-order curves by iteratively updating itself. The curve can be expressed as:

(7)$$I_{N+1}^{1/\gamma} = I_{N}^{1/\gamma} + \alpha_{N}\cdot I_{N}^{1/\gamma}(1 - I_{N}^{1/\gamma}),$$

where, $I_{N+1}$ and $I_{N}$ are the output and input images after $n$ iterations respectively, $\alpha _{N}$ is a pixel-wise nonlinear mapping map for $I_{N}$. According to Zero-DCE, this simple curve has a similar shape to the mapping curves in Fig. 2(a), and it is differentiable and can approximate any other higher-order curves after several iterations.

DCE uses a plain CNN of seven convolutional layers with symmetrical concatenation to calculate mapping map $\gamma _{N}$. To deal with CNNs’ shortcomings of limited receptive fields and static weights, a more powerful alternative is the vision transformer with attention mechanism. Inspired by the core idea of DCE and the attention mechanisms of vision transformers, we propose the contextual transformer (CT), as shown in Fig. 3. CT can provide a better alternative to DCE and learn the pixel-wise map $\gamma _{N}$ from a larger receptive field, which is defined as:

(8)$$\left\{ \begin{array}{l} F_{i}^{'} = F_{i} + W_{i}\cdot\textbf{V}\otimes Softmax(\textbf{K}^\mathrm{T}\otimes\textbf{Q}),\\ F_{i}^{\prime\prime} = F_{i}^{'} + F_{i}^{'}\odot \sigma(F_{i}^{'}), \\ \end{array} \right. i=1, 2, 3, 4,$$

where, $\textbf {Q}\in \mathbb {R}^{HW\times C}$, $\textbf {K}^\mathrm {T}\in \mathbb {R}^{C\times HW}$, $\textbf {V}\in \mathbb {R}^{HW\times C}$ are query, key and value projections from feature $F_{i}$ respectively. $W$ is the weight matrix achieved by context blocks, and $\sigma$ is the GLUE activation operation. $i$ represents the number of the contextual transformer.

Comparing Eq. (7) and Eq. (8), we can find that the curve used in DCE can be approximated as a special case of CT. Moreover, we discard the up/down sampling operations in [34] which might break the relationship of neighboring pixels, and stack 4 CTs in a coarse-to-fine manner to estimate an illumination mapping curve in each distinct local region. As shown in Fig. 3, the contextual transformer block (CTB) use 4 CTs with different receptive field to process the input features $F_{i=1,2,3,4}$ respectively and fuse the output features $F_{i=1,2,3,4}^{''}$ in turn to get the final feature $\overline {F_{N}}$. Next, $\overline {F_{N}}$ is transferred to the denoising block as the input feature.

3.2.3 Dual-supervised attention block

Instead of simply stacking multiple stages in a cascading way, we design a dual-supervised attention block (DAB) between every two stages, which can effectively refine the incoming features before further propagation. The schematic diagram of DAB is shown in Fig. 4. DAB takes previous features $\overline {F_{N}}\in \mathbb {R}^{HW\times C}$ and $\overline {\overline {F_{N}}}\in \mathbb {R}^{HW\times C}$ from CTB and DB respectively and generates supervised features $\overline {I_{N}}\in \mathbb {R}^{HW\times 3}$, $\overline {\overline {I_{N}}}\in \mathbb {R}^{HW\times 3}$ using a 3$\times$3 convolutional layer. Then, $\overline {I_{N}}$, $\overline {\overline {I_{N}}}$ are used for calculating supervised losses, which we will describe later in Sec.3.3. After concatenating $\overline {I_{N}}$ and $\overline {\overline {I_{N}}}$, we use 1$\times$1 convolution and 3$\times$3 depth-wise convolution to restore the channel numbers back to $C$. Finally, the supervised features and the original features are multiplied using the attention mechanism to produce the final attention-guided feature $F_{N+1}$. In addition, $F_{N+1}$ is transferred to the stage $(N+1)$ as the input feature.

Fig. 4. Dual-supervised Attention Block.

Download Full Size | PDF

3.3 Loss function

Using low-light and normal-light image pairs, we can effectively train our Dark2Light. From Sec.3.2 we can know that, given the low-light image $I_{l}\in \mathbb {R}^{HW\times 3}$, Dark2Light can generate multi-stage results $(\overline {F_N}, \overline {\overline {F_N}})\in \mathbb {R}^{HW\times C}$, where $N$ relates to the stage number. Then, Dark2Light generates supervised features $(\overline {I_N}, \overline {\overline {I_N}})\in \mathbb {R}^{HW\times 3}$ in DAB and calculates pixel-wise loss for gradient backward, as mentioned below:

(9)$$\mathcal{L} = \sum_{i=0}^{N-1}\left\{ \mathcal{L}_{1}(\overline{I_N},\frac{I_n^{'1/\gamma}}{N-i}) + \mathcal{L}_{1}(\overline{\overline{I_N}},\frac{I_n^{'1/\gamma}}{N-i}) + \mathcal{L}_{edge}(\overline{\overline{I_N}},\frac{I_n^{'1/\gamma}}{N-i}) \right\} ,$$

where, $i$ is the specific stage number and $N$ is the total stage number. $\overline {I_N}$, $\overline {\overline {I_N}}$ are the supervised features from CTB and DN respectively. $I_n^{'1/\gamma }$ is the ground-truth normal-light image after inverse gamma correction. We also design a stepped loss for each stage using the factor $(N-i)$ and add an additional edge loss to guide the denoising block. Additional, $\mathcal {L}_{1}$ represents the $\mathcal {L}_{1}$ loss and $\mathcal {L}_{edge}$ represents the edge loss [35], defined as:

(10)$$\left\{ \begin{array}{l} \mathcal{L}_{1}(I_N,I_n^{'1/\gamma}) = \lVert I_N-I_n^{'1/\gamma} \rVert_1,\\ \mathcal{L}_{edge}(I_N,I_n^{'1/\gamma}) = \sqrt{\lVert \triangle(I_N)-\triangle(I_n^{'1/\gamma}) \rVert^{2} + \epsilon^{2}},\\ \end{array} \right.$$

where, $\triangle$ denotes the Laplacian operator, $\epsilon$ is set to be $1 \times 10^{-6}$ during training.

4. Experiment and analysis

4.1 Experimental configurations

4.1.1 Datasets

We evaluated our model on two benchmarks captured under the real-world low-light settings, including LoL-v1 [7] and LoL-v2 [8]. LoL-v1 dataset contains 500 low/normal-light image pairs, and we split 485 pairs for training and 15 pairs for testing. The LoL-v2 dataset is larger and more diverse than LoL-v1, including 689 low/normal-light pairs for training and 100 pairs for testing. It is worth noting that LoL-v2 captures completely different scenarios when building the training set and testing set. But in LoL-v1, the scenarios captured in the test set are also the same as the training set, which makes the model trained on LoL-v1 have better performance than the model trained on LoL-v2.

4.1.2 Training details

We implement our model with PyTorch framework on 1 NVIDIA GTX 3090 GPU. We set the number of stages $N$ to 4 to obtain the best results, and the effect of the stage number will be discussed in Sec.4.3. The entire network is trained with AdamW optimizer ($\beta _{1}=0.9$, $\beta _{2}=0.999$, weight decay$=1 \times 10^{-4}$) with a batch size of 4 for 80K iterations. The initial learning rate is set to be $1 \times 10^{-4}$, and it is reduced to half of itself at every 20K iteration. Patches at the size of 128$\times$128 are cropped from the training set with random horizontal and vertical flips. We perform evaluations on the testing set at each epoch during training, selecting the model with the highest scores as the final result.

4.2 Comparison with current methods

4.2.1 Quantitative results

We compare our method with a wide range of state-of-the-art methods for LLIE, including Zero-DCE [6], DRBN [29], KinD [27], LLFormer [36], Restormer [34], MIRNet-v2 [37], SNR-Net [30]. The PSNR and SSIM results are reported in Table 1 For most of these methods, we obtain their numerical results from released pre-trained models. For the methods without pre-trained models, we use the publicly available codes and train each method for 80K iterations in the same manner as our training strategy. Obviously, our Dark2Light outperforms all the methods mentioned above on LoL-v1/2 datasets. Detailed, Dark2Light achieves PSNR/SSIM values of 25.04/0.850 on LoL-v1 datasets, achieves PSNR/SSIM values of 21.74/0.846 on LoL-v2 datasets, which illustrates the superiority of our multi-stage model. Also, we calculate the color difference (CDiff) and universal quality index (UQI), it turns out that our results also have the best color fidelity and great consistency in the abstract feature domain.

Table 1. Quantitative comparison with state-of-the-art methods on LoL-v1/v2. (Bold numbers represent the best, Underlined numbers represent the second-best)

View Table | View all tables in this article

4.2.2 Qualitative results

We show visual comparison in Fig. 5 on LoL-v1. Obviously, Zero-DCE can’t eliminate severe noise in low-light images by simply enhancing illumination. Other methods like LLformer and Restormer, can suppress noise well but at the cost of local details. Also, SNR-Net can’t handle their fusion strategy of CNN and transformer architectures in smooth regions, and some unnatural artifacts appear in their results. While MIRNet-v2 has the best illumination enhancement results with less noise, their results have lower color fidelity. Among them, Dark2Light achieves results with less noise and rich details, without sacrificing the color fidelity.

Fig. 5. Visual comparison with state-of-the-art methods on LoL-v1. (Patches with zoom-in view shown in white box)

Download Full Size | PDF

Furthermore, we show more visual comparison in Fig. 6 on LoL-v2. Compared to LoL-v1, LoL-v2 has more images in its training set and testing set, resulting in greater differences in the results of all methods. Comparing all the results, we notice two key improvements of our method over the others. First, our method is able to suppress more noise and achieve better contrast in both smooth areas and rough areas, without obviously sacrificing the detail of the images. As can be seen from the last two rows of Fig. 6, our method generates the best detail of the iron grid and the billboard text. Second, our method also reveals vivid and natural colors, making the enhanced results look more realistic. For the first row of Fig. 6, only our method can recover the correct shape and the natural color of the light bulb.

Fig. 6. Visual comparison with state-of-the-art methods on LoL-v2. (Patches with zoom-in view shown in white box)

Download Full Size | PDF

Lastly, we also conduct experiments on real photos captured by smartphones. Using models trained on LoL-v1, we can successfully produce visually pleasing normal-light images with Dark2Light, even if this model has never seen this specific device before. As shown in Fig. 7, Dark2Light can achieve more natural normal-light images than MIRNet-v2, with higher contrast and fine details without over-enhancement in dark regions.

Fig. 7. Visual comparison with MIRNet-v2 on real-captured images.

Download Full Size | PDF

4.3 Ablation study

Stage number. For our multi-stage model, the number of stages is a crucial factor to evaluate. To verify our motivation of multi-stage architecture being beneficial in handling LLIE, we conducted experiments on the number of stages in the model, as shown in Table 2.

Table 2. Ablation study on the number of stages.

View Table | View all tables in this article

We can observe that the multi-stage model achieves the highest performance when the number of stages is 4. As a reasonable speculation about this experiment, the model is prone to under-fitting when it has a small $N$. On the contrary, the model is prone to over-fitting when it has a large $N$. Based on this result, we set $N$ to be 4 when conducting comparison experiments with other methods in Sec.4.2.

Furthermore, a major advantage of our method over existing ones is that we build the multi-stage model based on the image formation model. Within each stage, Dark2Light first uses CTB for illumination enhancement and then DB for noise removal. To better understand how our model refines features, we visualize the supervised features within each stage in Fig. 8. We can clearly see that the results become brighter and have less noise as the stage goes further.

Fig. 8. Generated results from different stages.

Download Full Size | PDF

Individual components. We consider two ablation settings by removing different components from our model individually.

• Ours w/o DN+DAB. Removes the denoising block and the dual-attention block, so the model only has contextual transformer blocks.
• Ours w/o DAB. Removes the dual-attention block, keeping contextual transformer block and denoising block.

Since DAB requires two supervisory features, we discard the ablation setting on the model without only DN. We performed both two ablation settings on the LoL-v1 dataset. Table 3 summarizes the results.

Table 3. Ablation study on the individual components.

View Table | View all tables in this article

Compared with all ablation settings, our full model achieves the best PSNR/SSIM. Furthermore, we make a visual comparison of models with and without denoising blocks, as shown in Fig. 9. The model without DN+DAB produced results that have more noise and suffer from a loss of color fidelity.

Fig. 9. Visual comparison of models w and w/o DN+DAB.

Download Full Size | PDF

5. Conclusion

In this paper, we proposed what we believe to be a novel framework for LLIE through a multi-stage model namely Dark2Light. Dark2Light is designed based on a real image formation model, which enables it to learn two main problems concerning the LLIE: (1) illumination enhancement and (2) noise removal. Extensive experiments on the existing benchmark datasets and real-captured images show that our proposed framework can achieve better quantitative and qualitative results.

Funding

National Natural Science Foundation of China (62275229); Pre-research project of civil aerospace technology (D040107).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. H.-J. Kim, J.-M. Lee, J.-A. Lee, et al., “Contrast enhancement using adaptively modified histogram equalization,” in Advances in Image and Video Technology: First Pacific Rim Symposium, PSIVT 2006, Hsinchu, Taiwan, December 10-13, 2006. Proceedings 1, (Springer, 2006), pp. 1150–1158.

2. S.-D. Chen and A. R. Ramli, “Contrast enhancement using recursive mean-separate histogram equalization for scalable brightness preservation,” IEEE Trans. Consumer Electron. 49(4), 1301–1309 (2003). [CrossRef]

3. E. H. Land and J. J. McCann, “Lightness and retinex theory,” J. Opt. Soc. Am. 61(1), 1–11 (1971). [CrossRef]

4. D. J. Jobson, Z.-u. Rahman, and G. A. Woodell, “Properties and performance of a center/surround retinex,” IEEE Trans. on Image Process. 6(3), 451–462 (1997). [CrossRef]

5. D. J. Jobson, Z.-u. Rahman, and G. A. Woodell, “A multiscale retinex for bridging the gap between color images and the human observation of scenes,” IEEE Trans. on Image Process. 6(7), 965–976 (1997). [CrossRef]

6. C. Guo, C. Li, J. Guo, et al., “Zero-reference deep curve estimation for low-light image enhancement,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 1780–1789.

7. W. Y. Chen Wei, W. Wang, and J. Liu, “Deep retinex decomposition for low-light enhancement,” in British Machine Vision Conference, (2018).

8. W. Yang, W. Wang, H. Huang, et al., “Sparse gradient regularized deep retinex network for robust low-light image enhancement,” IEEE Trans. on Image Process. 30, 2072–2086 (2021). [CrossRef]

9. H. Wach and E. R. Dowski Jr, “Noise modeling for design and simulation of computational imaging systems,” in Visual Information Processing XIII, vol. 5438 (SPIE, 2004), pp. 159–170.

10. K. Wei, Y. Fu, J. Yang, et al., “A physics-based noise formation model for extreme low-light raw denoising,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 2758–2767.

11. S. Chen, J. Zhou, M. Li, et al., “Mobile image restoration via prior quantization,” Pattern Recognition Letters (2023).

12. X. Tao, H. Gao, X. Shen, et al., “Scale-recurrent network for deep image deblurring,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2018), pp. 8174–8182.

13. H. Zhang, Y. Dai, H. Li, et al., “Deep stacked hierarchical multi-patch network for image deblurring,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 5978–5986.

14. S. Chen, H. Feng, and D. Pan, “Optical aberrations correction in postprocessing using imaging simulation,” ACM Trans. Graph. 40(5), 1–15 (2021). [CrossRef]

15. D. Ren, W. Zuo, Q. Hu, et al., “Progressive image deraining networks: A better and simpler baseline,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2019), pp. 3937–3946.

16. S. W. Zamir, A. Arora, S. Khan, et al., “Multi-stage progressive image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2021), pp. 14821–14831.

17. S. Chen, T. Lin, and H. Feng, “Computational optics for mobile terminals in mass production,” IEEE Trans. Pattern Anal. Mach. Intell. 45(4), 4245–4259 (2023). [CrossRef]

18. R. C. Gonzales and P. Wintz, Digital image processing (Addison-Wesley Longman Publishing Co., Inc., 1987).

19. Q. Wang and R. K. Ward, “Fast image/video contrast enhancement based on weighted thresholded histogram equalization,” IEEE Trans. Consumer Electron. 53(2), 757–764 (2007). [CrossRef]

20. K. S. Sim, C. P. Tso, and Y. Y. Tan, “Recursive sub-image histogram equalization applied to gray scale images,” Pattern Recognit. Lett. 28(10), 1209–1221 (2007). [CrossRef]

21. S. Chen, H. Feng, K. Gao, et al., “Extreme-quality computational imaging via degradation framework,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2021), pp. 2632–2641.

22. S. M. Pizer, E. P. Amburn, and J. D. Austin, “Adaptive histogram equalization and its variations,” Computer vision, graphics, and image processing 39(3), 355–368 (1987). [CrossRef]

23. R. Kimmel, M. Elad, and D. Shaked, “A variational framework for retinex,” Int. J. computer vision 52(1), 7–23 (2003). [CrossRef]

24. M. K. Ng and W. Wang, “A total variation model for retinex,” SIAM J. Imaging Sci. 4(1), 345–365 (2011). [CrossRef]

25. K. G. Lore, A. Akintayo, and S. Sarkar, “Llnet: A deep autoencoder approach to natural low-light image enhancement,” Pattern Recognit. 61, 650–662 (2017). [CrossRef]

26. C. Chen, Q. Chen, J. Xu, et al., “Learning to see in the dark,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2018), pp. 3291–3300.

27. Y. Zhang, J. Zhang, and X. Guo, “Kindling the darkness: A practical low-light image enhancer,” in Proceedings of the 27th ACM international conference on multimedia, (2019), pp. 1632–1640.

28. Y. Jiang, X. Gong, D. Liu, et al., “Enlightengan: Deep light enhancement without paired supervision,” IEEE Trans. on Image Process. 30, 2340–2349 (2021). [CrossRef]

29. W. Yang, S. Wang, Y. Fang, et al., “From fidelity to perceptual quality: A semi-supervised approach for low-light image enhancement,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 3063–3072.

30. X. Xu, R. Wang, C.-W. Fu, et al., “Snr-aware low-light image enhancement,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2022), pp. 17714–17724.

31. A. Foi, M. Trimeche, V. Katkovnik, et al., “Practical poissonian-gaussian noise modeling and fitting for single-image raw-data,” IEEE Trans. on Image Process. 17(10), 1737–1754 (2008). [CrossRef]

32. W. Shi, J. Caballero, F. Huszár, et al., “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 1874–1883.

33. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, (Springer, 2015), pp. 234–241.

34. S. W. Zamir, A. Arora, S. Khan, et al., “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2022), pp. 5728–5739.

35. K. Jiang, Z. Wang, P. Yi, et al., “Multi-scale progressive fusion network for single image deraining,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2020), pp. 8346–8355.

36. T. Wang, K. Zhang, T. Shen, et al., “Ultra-high-definition low-light image enhancement: A benchmark and transformer-based method,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37 (2023), pp. 2654–2662.

37. S. W. Zamir, A. Arora, and S. Khan, “Learning enriched features for fast image restoration and enhancement,” IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1934–1948 (2022). [CrossRef]

Methods	LoL-v1				LoL-v2
Methods	PSNR $↑$	SSIM $↑$	CDiff $↓$	UQI $↑$	PSNR $↑$	SSIM $↑$	CDiff $↓$	UQI $↑$
Zero-DCE [6]	14.86	0.559	21.00	0.306	18.06	0.574	15.24	0.385
DRBN [29]	20.13	0.830	10.32	0.414	20.29	0.831	12.48	0.433
KinD [27]	20.86	0.790	9.588	0.462	14.74	0.641	11.67	0.464
LLFormer [36]	23.65	0.819	8.656	0.489	20.49	0.814	11.35	0.559
Restormer [34]	23.09	0.824	8.744	0.498	19.94	0.827	10.82	0.547
SNR-Net [30]	24.61	0.842	7.62	0.530	21.41	0.849	9.892	0.575
MIRNet-v2 [37]	24.74	0.849	7.806	0.524	21.03	0.812	11.47	0.544
Ours	25.04	0.850	7.527	0.529	21.74	0.846	9.879	0.550

Stage Number $N$	2	3	4	5
PSNR	22.47	23.89	25.04	24.89
SSIM	0.829	0.842	0.850	0.848

	ours w/o DN+DAB	ours w/o DAB	ours
PSNR	22.60	23.19	25.04
SSIM	0.824	0.838	0.850

Methods	LoL-v1				LoL-v2
Methods	PSNR $↑$	SSIM $↑$	CDiff $↓$	UQI $↑$	PSNR $↑$	SSIM $↑$	CDiff $↓$	UQI $↑$
Zero-DCE [6]	14.86	0.559	21.00	0.306	18.06	0.574	15.24	0.385
DRBN [29]	20.13	0.830	10.32	0.414	20.29	0.831	12.48	0.433
KinD [27]	20.86	0.790	9.588	0.462	14.74	0.641	11.67	0.464
LLFormer [36]	23.65	0.819	8.656	0.489	20.49	0.814	11.35	0.559
Restormer [34]	23.09	0.824	8.744	0.498	19.94	0.827	10.82	0.547
SNR-Net [30]	24.61	0.842	7.62	0.530	21.41	0.849	9.892	0.575
MIRNet-v2 [37]	24.74	0.849	7.806	0.524	21.03	0.812	11.47	0.544
Ours	25.04	0.850	7.527	0.529	21.74	0.846	9.879	0.550

Stage Number $N$	2	3	4	5
PSNR	22.47	23.89	25.04	24.89
SSIM	0.829	0.842	0.850	0.848

Dark2Light: multi-stage progressive learning model for low-light image enhancement

Abstract

1. Introduction

2. Related works

3. Proposed method

3.1 Preliminaries

3.1.1 Raw image formation model

3.1.2 sRGB image formation model

3.2 Dark2Light

3.2.1 Overall pipeline

3.2.2 Contextual transformer block

3.2.3 Dual-supervised attention block

3.3 Loss function

4. Experiment and analysis

4.1 Experimental configurations

4.1.1 Datasets

4.1.2 Training details

4.2 Comparison with current methods

4.2.1 Quantitative results

4.2.2 Qualitative results

4.3 Ablation study

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (3)

Equations (10)

Optics Express