Article Information
Corresponding Author: Chen Yong* , chenyong@cqupt.edu.cn
Chen Yong*, School of Automation, Chongqing University of Posts and Telecommunications, Chongqing, China, chenyong@cqupt.edu.cn
Meiyong Huang, School of Automation, Chongqing University of Posts and Telecommunications, Chongqing, China, s200301006@stu.cqupt.edu.cn
Huanlin Liu, School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing, China, liuhl@cqupt.edu.cn
Jinliang Zhang, School of Automation, Chongqing University of Posts and Telecommunications, Chongqing, China, s210302013@stu.cqupt.edu.cn
Kaixin Shao, School of Automation, Chongqing University of Posts and Telecommunications, Chongqing, China, s210301043@stu.cqupt.edu.cn
Received: January 27 2022
Revision received: April 1 2022
Revision received: July 7 2022
Accepted: July 21 2022
Published (Print): August 31 2022
Published (Electronic): August 31 2022
1. Introduction
Unevenly illuminated low-light images suffer from low visibility in some local regions. To alleviate this problem, researchers have developed numerous promising approaches to tackle the low-light image enhancement task effectively. They can be roughly divided into physical means, histogram equalization (HE)-based, Retinex-based, deep learning-based, and adversarial learning-based methods. One characteristic of underexposed images is low signal-to-noise ratio (SNR), which means noise is highly intensive and dominates the image signals [1]. Some physical means take the effort to acquire sufficient light for cameras, such as extending exposure time or increasing ISO (International Standardization Organization). However, the former may introduce blur when the camera shakes or the object moves, and the latter may introduce intensive noise with higher ISO, degrading the quality of the images.
HE-based methods take advantage of being executed in real-time, which benefits from simply stretching the dynamic range of images by evenly rearranging pixels. Brightness preserving dynamic histogram equalization (BPDHE) [2] is a method of global histogram equalization which performs well in preserving the lightness order of the input image. However, it fails to recover the details for dark regions because of gray-level merging. HE-based methods generally make global adjustments, which consider local dark areas insufficiently, resulting in over-/under-exposure. Besides, this kind of method fails to tackle the noise efficiently.
Retinex-based methods, which assume that an image is an integration of illumination and reflectance, are based on the Retinex theory. The key idea is to estimate and remove the influence of illumination, such as the average intensity of light. KinD++ [3] follows a divide-and-conquer principle. Not only does it brightens dark regions, but it also removes hidden degradation artifacts such as noise and color distortion. However, for Retinex-based methods, as there are no clear definitions of ground-truth illumi¬nation and reflectance, the decomposition of an image becomes difficult.
Recently, the development of deep learning-based methods has significantly boosted the performance of image restoration tasks via learning the underlying signal features of input images. Lore et al. [4] proposed a stacked deep auto-encoder named low-light net (LLNet) to learn joint denoising and lightness enhancement, the first deep learning method used in the low-light enhancement field. [5] gives attention to underexposed regions to avoid overexposure of local areas through generating attention maps. Deep stacked Laplacian restorer (DSLR) [6] proposed a decomposition-based scheme that separately recovers the global illumination and local details from the original input. It leverages valuable properties of the Laplacian pyramid based on great connections of higher-order residuals in a multi-scale structure both at the image and feature level. It is worth noting that most deep learning-based methods must be supported by large-scale paired low-/normal-light datasets. However, as it is impractical for low-/normal-light image pairs to appear concurrently, it is challenging to collect large-scale paired low-/normal-light datasets with diversified content
Generative adversarial networks (GANs) learn the mapping between two domains using adversarial learning, which shows excellent performance when dealing with domain transfer learning tasks. For example, [7] has a good effect on style transfer without paired images. Unpaired datasets contain images from two domains with no need to be in the same scene but need to present essential characteristics of the domain, such as dark or bright. As images with low and normal illumination belong to the low and normal illumination domains, researchers tend to adopt the idea of domain transfer and apply a GAN to low-light image enhancement tasks using unpaired datasets. This kind of work overcomes the lack of large-scale paired datasets, which shows the remarkable advantages of GANs that can be trained with unpaired data.
Nevertheless, the task of low-light image enhancement remains challenging. (1) Previous literature takes insufficient consideration for local dark regions, which introduces over-/under-exposure artifacts during the enhancement procedure. (2) The enhancement procedure smooths out local structural details and distorted color, causing the images to be inconsistent with human perceptual preference [8], which is not determined by a single aspect. However, previous literature focuses on only a single problem, such as illumination improvement, details recovery, or noise removal. We claim that for images with fluc¬tuating illumination distributions, the enhancement task needs to consider several aspects simultaneously, such as enhancing brightness to global and local areas, restoring local structural details, controlling color deviations, removing undesirable noise, and so on.
To satisfy the above goals and generate high-quality images, we propose a GAN-based local lightness-aware enhancement network that enhances locally underexposed images based on the guidance of the unpaired low-/normal-light images inspired by [9]. To be specific, our architecture includes three components: a super-resolution module (SRM) for details preservation, a low-light image enhancement network (LLIEN) for lightening up global and local dark regions, and a denoising-scaling module (DSM) for noise suppression. In the first stage, the SRM firstly improves the resolution of input low-light images and generates their fine-grained high-resolution versions. Such design philosophy enables the subsequent enhancement procedure to be accomplished in high-resolution space so that the texture information remains since the details of local dark regions have been amplified by super-resolution. In the second stage, to enhance the local illumination and obtain images where illumination is uniformly distributed, we integrate the local lightness attention module into LLIEN, which guides the generator to emphasize local low-light areas adaptively. Then, we introduce multiple discriminators to evaluate the enhanced images from different perspectives, driving the network to generate images that match human visual preferences. In specific, the global discriminator evaluates the global lightness enhancement. The local region discriminator distinguishes whether local areas are lightened up to realistic normal-light ones, which helps to improve the lightness of local dark areas. The color discriminator evaluates the naturalness of the restored color, which is essential in controlling color bias. In the last stage, the DSM removes noise amplified in SRM and LLIEN and performs a down-sampling operation to the original scale as an input low-light image. Compared with other algorithms, our approach considers multiple aspects, such as illumination improvement, details recovery, or noise removal, to conform to human visual preference instead of a single task. Therefore, the results of our method are more realistic and achieve higher aesthetic quality. Both qualitative and quantitative experiments indicate that our approach achieves considerable enhancements. To sum up, the main contributions of this work are as follows:
We propose a super-resolution strategy specifically designed to perform enhancement in high-resolution space, enabling the network to retain details of contents and texture for local areas suffering from low visibility.
We design a local lightness attention module to distinguish areas of underexposed regions from well-illuminated regions, enabling the network to pay more attention to local dark regions and prevent the whole image from over-/under-exposure artifacts.
We introduce multiple discriminators, which assess the enhanced images from the perspectives of global illumination distribution, local area exposure, and color distortion, driving the network to generate images that conform to human perceptual preferences.
The organization of the rest of the paper is as follows. Section 2 briefly presents the SRM and then introduces the LLIEN, where the local lightness attention module is introduced first. After that, multiple discriminators and DSM are presented successively. Section 3 provides performance analysis, including an ablation study and a comparison with other algorithms. Finally, concluding remarks are provided in Section 4.
2. Proposed Approach
2.1 Architecture Overview
The primary purpose of our method is to light up local dark regions and the whole low-light images globally while recovering details of the texture, avoiding over-/under-exposure for local regions, and controlling color deviation. As illustrated in Fig. 1, our model consists of three main components: SRM, LLIEN (which includes a local lightness attention module, and multiple discriminators), and DSM. The SRM performs the resolution improvement for input low-light images and then feeds the fine-grained low-light images into LLIEN. This strategy helps to avoid detail loss during the process of lightness enhancement in LLIEN. In LLIEN, the local lightness attention module generates an attention map that distinguishes dark and bright areas. Under the guidance of the attention map, the generator of LLIEN pays more attention to lightening up local dark regions rather than bright ones, avoiding over-/under-exposure artifacts. Afterward, benefitting from carefully designed loss functions, multiple discriminators evaluate the generated images from different perspectives, helping LLIEN to enhance global and local brightness and control color deviation. DSM removes the noise and then samples down the clean image to the size of the original low-light image at last.
Illustration of the proposed model. SRM generates a high-resolution version of low-light images, LLIEN lightens up local dark regions and the whole images, and DSM suppresses noise and generates clear, enhanced images.
2.2 Super-Resolution Module
Local structural details are usually smoothed out [10] during the low-light image enhancement procedure. Confronting such challenges, we design an SRM. Although bilinear interpolation is a practical approach for super-resolution tasks, the blur will be introduced to underexposed areas. To deal with this issue, we employ a classical method, EDSR [11]. The structure of SRM can be summarized by 32 identical residual blocks, which is crucial for details recovery, as well as several convolutions and up-sampling layers. Each residual block contains two convolutional layers with a ReLU activation function in the middle but without batch normalization layers. Firstly, the original input low-light images are fed into SRM. Then, features extracted by residual blocks are fused with the features of the input and subsequently upsampled to twice the original size. Constant scaling layers with a scaling factor of 0.1 are placed last for stable training.
2.3 Low-light Image Enhancement Network
2.3.1 Local lightness attention module
To prevent the local regions from over-/under-exposure artifacts, we introduce a local lightness atten¬tion module that generates an attention map to guide LLIEN to pay more attention to local dark regions. The local lightness attention module consists of channel attention and spatial attention. Channel attention is vital in deciding which feature maps are more meaningful than others. Spatial attention concentrates on the informative part of a specific feature map at the pixel level. The structure diagram of the local lightness attention module is illustrated in Fig. 2. To be specific, inspired by [12,13], the module firstly performs channel-wise global average pooling operation to aggregate the spatial information and obtain a squeezed vector. Then it generates weighted vector through two fully-connected (fc) layers, a ReLU (Rectified Linear Unit) function as well as a sigmoid function. Next, we obtain channel-wise attention map by multiplying weighted vector with the input feature. In short, channel attention can be expressed as follows:
Structure diagram of the local lightness attention module, which is a combination of channel attention and spatial attention.
where [TeX:] $$I$$ denotes input feature maps of low-light images, [TeX:] $$W_{1}$$ and [TeX:] $$W_{2}$$ denote two fc layers, [TeX:] $$\delta$$ refers to the ReLU function, and [TeX:] $$\sigma$$ refers to the sigmoid function.
To highlight the informative regions, a global average pooling and a global max pooling operation are applied to the feature maps along the channel axis, respectively. Each squeezes the number of channels to one and transforms the initial feature maps from [TeX:] $$W \times H \times C \text { to } W \times H \times 1$$, where [TeX:] $$W$$ denotes width and [TeX:] $$H$$ denotes height. These two feature maps are concatenated to generate an efficient feature descriptor. Then convolution layers and a sigmoid function are applied to the concatenated feature descriptor to acquire the spatial attention map. In short, spatial attention can be represented as a mathematical formula as follows:
2.3.2 Multiple discriminators
Global discriminator
Global discriminator is dedicated to discriminating the enhanced images from the real normal-light images following considerations of whether they satisfy the distribution of real normal-light images. The global discriminator assists the network in improving the holistic illumination of low-light images at the image level, generating globally enlightened images.
Local region discriminator
An image-level global discriminator is not enough to enhance the local dark areas. Inspired by [9], we add a local region discriminator to consider local dark areas and enhance the lightness globally fully. Specifically, we evenly crop the generated and real normal-light images into sub-images for every gener¬ated image. The number of sub-images is preset to 4 to reduce the burden of computational cost. The local region discriminator evaluates whether each sub-image looks like a realistic, normally illuminated image, ensuring that over-/under-exposure artifacts are avoided for all local bright/dark regions.
Color discriminator
We use an image assessment network pre-trained on the Aesthetic Visual Analysis (AVA) dataset to evaluate the aesthetic quality of the enhanced results. As it is difficult to assess an image with no re¬ference, we adopt a relativistic classifier evaluating paired inputs composed of synthetic and ground truth images. The classifier outputs a binary number showing whether the enhanced image has “higher” (1) or “lower” (0) aesthetic quality than the ground-truth image. This strategy drives LLIEN to generate images with more realistic colors than ground truth.
In conclusion, the multiple discriminators evaluate the enhanced result from three aspects, aiming to restore global brightness and fine details while avoiding over-/under-exposure and the color cast.
2.3.3 Loss functions
Adversarial loss: We adopt the original LSGAN (least squares generative adversarial networks) [14] loss as our adversarial loss to learn the mapping between underexposed and target normal light images.
Color loss: We introduce color loss to enforce the generated images to satisfy the color distribution of the normal-light images,
where [TeX:] $$\hat{y}$$ indicates the ground-truth binary number, [TeX:] $$I_{en\text{h}}$$ and [TeX:] $$I_{G}$ represent the enhanced result and the ground-truth image, respectively, and [TeX:] $$\Omega$$ is the aesthetic network.
Perceptual preserving loss: We use perceptual preserving loss [9] from the pre-trained VGG to model our feature space distance for preserving image content features.
[TeX:] $$I^{L}$$ stands for the input low-light image, and [TeX:] $$G(I)^{L}$$ denotes the enhanced result of the generator. [TeX:] $$\Phi_{i, j}$$ represents the feature map extracted from the VGG-16 pre-trained network. [TeX:] $$i$$ and [TeX:] $$j$$ represent [TeX:] $$i^{th}$$ max pooling and [TeX:] $$j^{th}$$ convolutional layer after [TeX:] $$i^{th}$$ max pooling layer. [TeX:] $$W_{i,j}$$ and [TeX:] $$H_{i, j}$$ are extracted feature maps’ width and height. We set [TeX:] $$i$$ to be 4 and [TeX:] $$j$$ to be 1.
Reconstruction loss: Reconstructing loss constrains the [TeX:] $$L_{1}$$ distance between the generated images and high-quality normal-light images, helping to drive the network to generate more realistic images.
The overall loss function for training our architecture is shown below:
2.4 Denoising-Scaling Module
To remove the noise amplified [15] during the process in the modules mentioned above, we propose the DSM. In detail, we adopt CBDNet [16], an efficient denoising approach to remove noise. CBDNet contains two subnetworks, i.e., [TeX:] $$\text{CNN}_{\text{E}}$$ and [TeX:] $$\text{CNN}_{\text{D}}$$, for estimating noise level maps and performing non-blind denoising. We first feed the enhanced images produced by LLIEN into [TeX:] $$\text{CNN}_{\text{E}}$$ to obtain an estimated noise level image map, then take both images as the input of [TeX:] $$\text{CNN}_{\text{D}}$$. In [TeX:] $$\text{CNN}_{\text{D}}$$, residual learning is adopted to generate final noise-removed enhanced images. Next, we perform a down-sampling operation on the original images to half the scale due to our LLIEN. Ultimately, we obtain the final enhanced images, the final clear and bright version of the input low-light images.
3. Performance Analysis
3.1 Datasets and Implementation Details
Thanks to the advantage that GAN-based networks can be trained with unpaired low-/normal-light images, we trained our model on the large-scale unpaired training set. The low-light images are collected from the Exclusively Dark dataset [17], and the normal-light images are collected from public datasets [18] and [19]. We investigate the performance of the proposed architecture on classic public datasets. We conduct all experiments with the PyTorch framework on GTX 1080Ti GPUs. Training details and structure models from EnlightenGAN [9] help to build our proposed framework. We fix the initial learning rate at 1e-4 in the first 100 epochs and exponentially decrease it to 0. We apply the Adam method to optimize the parameters and set the batch size to 16.
3.2 Ablation Study
To investigate the effectiveness of the proposed method, we perform an ablation study in this section. As shown in Fig. 3, the super-resolution strategy on the input image is beneficial to preserving details for local dark regions, which suggests the critical role of the SRM in generating high-quality images. Compared to removing SRM, we can observe that the contrast and illumination of the local dark regions are improved to a great extent. Additionally, texture details and color are recovered vividly with SRM.
Ablation study for investigating the contribution of the super-resolution module (SRM). Panels from left to right show input underexposed images, the results without SRM, and the results with SRM sequentially.
3.3 Comparison with State-of-the-Arts
In this section, we represent comparisons of our architecture with recent competing approaches through performing qualitative analysis, quantitative analysis, and user study.
3.3.1 Qualitative analysis
We evaluate the visual quality of the images enhanced by our approach to four other low-light enhancement methods, DSLR [6], LLNet [4], RRDNet [20], and KinD++ [3]. Fig. 4 shows representative qualitative results for visual comparison.
A visual comparison of our approach with five competitive methods.
In the first example, we can observe that our proposed framework successfully reconstructs texture details while evading overexposure for local regions, whereas DSLR [6] and LLNet [4] introduce over¬exposure, and KinD++ [3] introduces unnatural artifacts. In the second example, our proposed framework enhances brightness to a large extent, while other methods cannot enhance lightness sufficiently. In the third example, our method enhances the image with natural color consistent with human visual prefer¬ence, while other approaches introduce the color cast. We can obtain the following conclusions based on the observation from Fig. 4. The key advantages of our approach are as follows. (1) Our approach represents good preservation of details for dark regions and generates high-resolution images. (2) Our approach can lighten the holistic image and underexposed local regions while evading over-/under-exposure artifacts.
3.3.2 Quantitative analysis
As our unsupervised method does not need ground-truth images during training, we evaluate enhanced results of the proposed network and other approaches by adopting natural image quality evaluator (NIQE) [21], a non-reference image quality assessment. A lower NIQE demonstrates better visual quality. As reported in Table 1 where the bold font indicates the best performance in each test, our method achieves the lowest NIQE value on four publicly available image sets, indicating that our enhanced images are of high aesthetic quality [3,4,6,20,22].
Quantitative comparison between six architectures
User study
This section investigates the effects of six competing methods through a user study. We invite thirty volunteers to sort the quality of enhanced images selected manually from classical test sets. We consider the following aspects: over-/under-exposure of the local area, color bias, and details recovery. Fig. 5 shows the most votes from Rank 1st to Rank 6th, indicating that our approach achieves the best visual quality.
Quantitative result of rating distribution for six different algorithms.
4. Conclusion
This paper proposes a GAN-based local lightness-aware enhancement network for underexposed images to achieve lightness and details restoration for local dark areas and global lightness enhancement and color recovery. The key components include SRM, GAN-based LLIEN, and DSM. We implement a super-resolution strategy on low-light images, enabling subsequent enhancement accomplished in high-resolution space to preserve texture details in local dark areas. Next, we present a local lightness attention module to pay more attention to local dark regions. Benefiting from multiple discriminators, the LLIEN comprehensively discriminates the generated images from the perspectives of global lightness, local lightness, and color. Specifically, LLIEN enhances the global and local lightness and controls color deviation while avoiding over-/under-exposure artifacts in the absence of paired datasets, guiding the network to generate images that conform to human visual preference. Finally, the DSM suppresses noise and obtains high-quality enhanced images. Both qualitative and quantitative experiments indicate the effectiveness and generalization of our method.