Gradient-Guided Residual Learning for Inverse Halftoning and Image Expanding

Inverse halftoning and image expanding refer to problems to restore the pixel values of images from compressed images of smaller bit depth. Since these two problems are ill-posed, there are few perfect solutions. Recently, deep convolutional neural networks (DCNN) have shown their powerful ability in inverse halftoning and image expanding. However, the restored images still suffer from visual artifacts or fine details loss due to the improper design of network structure. To this end, this paper proposes a residual learning model for inverse halftoning and image expanding. The whole model consists of two progressive stages. The first stage is a gradient-guided DCNN, which coarsely recovers the main content of the image with the guidance of the predicted gradients. The second stage is a residual network, which learns the residual maps to fine-tune the coarse images, leading better local detail representation. Extensive experiments, including visual quality and numerical evaluation, are performed on the COCO data set. Results show that our method achieves the best performance when compared to the state-of-art methods.


I. INTRODUCTION
Bit depth refers to the number of bits that a pixel of an image consumes. It will largely affect the image quality. In our daily life, the most frequently used bit depth of a color image is 24. In other words, each pixel will consume 24 bits, 8 for each channel. However, when images are processed in resource-constrained environments, the bit depth needs to be compressed to fit the memory or bandwidth or accuracy requirement of the environment. For example, when an image is to be printed by a printer, its bit depth should be reduced to a small number (i.e. 1 bit), due to the low accuracy level of the printers. In the extreme case, the operation which reduces the bit depth to 1 is called halftoning. Unfortunately, the compression in bit depth will usually incur information loss, quantization noises, contouring artifacts and blocking artifacts. With the widespread applications of bit depth compression, how to restore high quality images from images The associate editor coordinating the review of this manuscript and approving it for publication was Po Yang . of small bit depth becomes an open problem. To this end, researchers have proposed a lot of methods to restore high quality images from images of 1 bit depth (inverse halftoning) or images of a few bit depth (image expanding).
Inverse halftoning and image expanding are multi-modal problems without unique solutions. Even if the method to compress the bit depth is known, there may be multiple images which can be compressed to the input image. Based on the principle of the bit depth compressing methods (i.e. halftoning) that the quantization error of a pixel is diffused to its neighboring pixels, a majority of methods are proposed to infer the value of each pixel from the features contained in its neighbor region, include filtering [1]- [4], projection on convex sets (POCS) [5]- [7], maximum a-posteriori (MAP) estimation [8], [9], wavelets [10], [11], look-up table (LUT) [12]- [14] and neural networks [15], [16]. However, these methods either fail to suppress the artifacts due to quantization or fail to recover the image details.
Hou and Qiu [21] and Xiao et al. [22] demonstrated that a deep convolution neural network (DCNN) model with shortcut connections can give a good solution to inverse halftoning and image expanding. In the basic conference version of this paper [23], we further proposed a gradient-guided model which infers gradient maps as guidance information for the latter restoring process. The predicted gradient information is shown to better guide the recovery of fine image details in the learning process. However, there is still room for improving the fine image details.
Inspired by recent image super-resolution methods [19], which further enhance the fine image details via a residual learning scheme, we propose a residual learning method for inverse halftoning and image expanding. Our method consists of two progressive stages, the coarse restoring stage and the residual learning stage. In the coarse restoring stage, we recover the main content of the image using the gradient guided DCNN model proposed in the basic version [23]. In the residual learning stage, we learn the residual image, the difference image between the ground truth and the result of the coarse restoring stage, using an extra residual network to fine-tune the fine image details. The final image is the summation of the main-content image plus the residual image. We conduct extensive experiments on the COCO data set, including visual quality and numerical evaluation. Results show that our method achieves the best performance when compared to the state-of-art methods. In particular, the fine details of the images are better restored and the PSNRs are higher in our method.
The basic method described in this paper was originally published in [23]. This extended version make several changes and add several new contributions. The main changes are as follows:(1) A novel two-stage model with an additional residual learning stage is proposed in Section III-A, which further improves the image quality; (2)More implementation details are presented in (Section III-A);(3) New loss functions and training scheme are presented in Section III-B;(4)New results using the improved model, more analysis and not previously described additional comparison (new tests on the edge enhanced half-toning data) are presented in Section IV-C and Section IV-D).
The contributions of this paper are as follows: 1) We introduce a gradient guided block to better guide the recovery of images; 2) We introduce the residual learning model to further enhance the fine image details; 3) We apply our model in inverse halftoning and image expanding and get state-of-the-art performances in both visual and numerical evaluation.

A. HALFTONING AND INVERSE HALFTONING
Halftoning [24]- [31] is a technology to compress images of large bit depth to images of 1-bit depth (binary images).
It converts an image to a series of black and white dots, varying either in dot size or in spatial frequency, so that the average value of a dot region is similar to that in the original image. Halftoning is widely used in image compression, output printing, LED display to fit the accuracy requirement of these environment. Inverse halftoning is the inverse process of halftoning, which recovers the continuous-tone image from binary images. The earliest methods are based on low pass filters, such as Gaussian filter [32], to remove the noise patters. Kite et al. [4] proposed a multi-scale method to estimate the image gradients and design an adaptive low-pass filter to better recover the image. Although the high frequency noise are removed in the image, the image details are also removed. Based on the all frequency decomposition property of wavelet transform, [10] proposed to recover the image by selectively choose appropriate information from suitable bands. Neelamani and Baraniuk [11] also proposed a wavelet-based inverse halftoning via Deconvolution (WInHD) to perform inverse halftoning for error-diffused halftones. Mese et al. constructed a look-up table (LUT) by gathering the histogram from a few halftone-original image pairs [12]. Wen et al. further proposed to optimize the table using the genetic algorithm [15]. Since no filtering is required, the LUT based methods are very fast. However, their performances greatly depend on the choice of table, and the quality of the restored images are still not satisfactory. Huang et al. proposed the first neural network based method [16] to handle halftoning and inverse halftoning simultaneously. Jimenez et al. proposed a multi-layer perceptron network based inverse halftoning method [33]. Before the network training, the halftone images are smoothed with a low-pass filter. Each pixel and its 4 × 4 neighbors in the filtered image are then feeded to the network to train a two-layer prospector. In fact, this MLP [33] based method is equivalent to a two-layer convolution neural network. A more recent work [21] and our previous work [22], [23] used deeper convolutional neural networks, resulting in much better image quality. But the fine details of the restored images still require further improvements.

B. IMAGE COMPANDING
Image companding [34]- [36] is the combination of image compresssing and image expanding. This technique squeezes images of large-bit depth to images of small-bit depth for tasks such as transmission, and restores the bit depth after the task. In early works [37], [38], image companding is implemented with nonlinear mappings, also called ''global'' tone-mapping methods. Li et al. [36] proposed a multi-scale subband architecture for bit depth compression. They also pointed out that the bit depth can be approximately restored using the similar scheme in the compression. Hou and Qiu [21] formulated the mapping between images of small bit depth and images of large bit depth with a convolutional neural network (CNN). This method achieved good image quality, but the restored images are still lack of fine details, such as textures.

III. PROPOSED METHOD
The basic version described in this paper was originally published in [23]. In this work, we propose an extended version with a residual learning process to enhance detail textures for the two image processing problems, i.e. inverse halftoning and expanding. We illustrate the details of the proposed progressive framework in section III-A, and then define the loss function of the network in section III-B.
Since we use the same network model to perform inverse halftoning and image expanding (only the training data are different), we will only describe our model using the example of inverse halftoning in the following subsections.

A. RESIDUAL LEARNING FRAMEWORK
As shown in Fig. 1, our network model consists of two major stages, the coarse restoring stage for the main content recovery and the residual learning stage for detail enhancement. The final detailed output is obtained by overlaying the residual map onto the coarse image. We will first introduce the gradient-guide network, and then introduce the residual network in this section. Fig. 2 gives an output example of two parts of our progressive residual framework.

1) COARSE RECOVERING STAGE
As shown in [21], [22], images restored by a single DCNN model suffer from loss of fine image details. To better train the network to form a good solution, we proposed a gradient-guided network module, where we first predict the gradient maps from the input image, and then use them to guide the coarse recovering.
The gradient-guided network consists of 3 subnetworks T 1 , T 2 and T 3 , which have the same U-shape architecture with skip connections between the encoder and decoder. We employ average pooling layer with 2 × 2 filters to down sample depth map after convolution layer in encoder. Assuming the input of subnetwork is I in ∈ R H ×W ×1 , then the operations in encoder of subnetwork can be expressed as where i ∈ {1, 2, 3}, F i en are the feature maps, D i en is the corresponding down-sampling results. W and b stand for the weight and bias in convolution operation, symbol * represents convolution operation, avgpool is average pooling operation. σ relu is the element-wise rectified linear unit (ReLU) activation function.
The decoder of subnetworks employ deconvolution layer for up-sampling. Since each deconvolution layer will reduce the number of feature channels by a half, we add an extra 1×1 convolution layer to increase the feature maps and operations in decoder can be expressed as where i ∈ {1, 2, 3}, symbol represents deconvolution operation, ⊕ represents concatenate operation. Finally, a 1 × 1 convolution layers is used to obtain the output image: Batch normalization is used in all convolution and deconvolution layers (except input and output layers). Specifically, given a halftone image I h ∈ R H ×W ×1 , T 1 and T 2 learn end-to-end nonlinear mappings between input halftone image and its gradient map of horizontal and vertical direction respectively. And then we take gradient maps learned as auxiliary information to guide T 3 producing coarse restoration with limited details. This procedure can formulated as: where I c ∈ R H ×W ×1 is the coarse image generated by our basic version [23]. I hx , I hy ∈ R H ×W ×1 are horizontal and vertical gradient maps inferred by T 1 and T 2 respectively. θ 1 , θ 2 and θ 3 represent the learned network parameters of T 1 , T 2 and T 3 respectively.

2) RESIDUAL LEARNING STAGE
To further exploit subtle texture, we can learn a residual map for reducing the differences between the coarse restoration and ground truth. Since the main content of the image is almost restored, we only need to fine-tune the local textures which is far from ground truth. Residual block proposed in [17] has an important feature of identity mapping, enabling network selectively adjusting detail representations. Similar ideas also appears in image super-resolution [18], [19]. As shown in Fig. 1, the main components of the residual learning stage are 8 residual blocks, and the whole residual network contains no down-sampling operation. The input of residual network I d ∈ R H ×W ×2 is a concatenation of halftone image I h and coarse image I c . Firstly, we use a convolution layer with 3 × 3 filters to expand feature dimension followed by 8 residual blocks sequentially to extract structure details in an incremental way. The operations in residual block can be expressed as where i ∈ {1, 2, . . . , 8} denotes the i th residual block, F i j denotes the feature map of the j th convolution layer in the i th residual block. For the case of i = 0, F 0 3 is F 0 in equation (10). After two more convolution layers, detail enhancement network outputs the corresponding residual map I res ∈ R H ×W ×1 . All size of convolution kernels are 3 × 3 except final convolution layer is 1 × 1. Then, the final output of our framework I o ∈ R H ×W ×1 can be formulated as: where F denotes the residual network, θ 4 denotes the learned parameters of F. σ tanh is the Tanh activation function.

B. NETWORK TRAINING AND LOSS FUNCTION
Our method is progressive. Given a halftone image, a coarse image is generated first by the gradient-guided network [23], and then improved result is obtained by the residual network we proposed in this paper. It's hard to train both networks from scratch at the very start. Our training procedure comprises two stages, each stage has different loss function.  The goal of the first training stage is to obtain a residual network which can enhance detail performance to some extent. We use the well-trained gradient-guided network in [23] and keep parameters θ 1 , θ 2 and θ 3 unchanged. We only train the residual network F(I h , I c ; θ 4 ) with L 1 loss in this stage. The loss function of the first stage is denoted as: where H and W is the height and width of the image, I Y represents the ground truth image, subscript (i, j) denotes the pixel value at the specified position.
To make two separate networks collaborate more seamless, we jointly fine-tune the parameters of two network and provide supervising information for the output of both network during the second training stage. To further emphasize extracting detail information, we add comparison of gradient maps to the loss function. The loss function of this stage is VOLUME 8, 2020 expressed as: where S(·) denotes Sobel operator which aims to extra edge map of given input image. α, β and γ are hyper-parameters that weight the relative importance of different loss. In this paper, we set α = 1.0, β = 1.5, γ = 10.0 by experience. It is worth noting that training jointly diminish the quality of coarse image, but the final result becomes better.

IV. EXPERIMENTS
We then demonstrate the qualitative and quantitative experiments of our method in detail. In section IV-A, we first introduce our experimental settings, including data set and detailed training parameters. Then we evaluate the inverse halftoning and image expanding task in section IV-C and IV-D respectively.

A. EXPERIMENT SETTINGS
The data set we use for network training and testing is Microsoft COCO [39], which is a large-scale database containing more than 80,000 images for training, and more than 40,000 images for testing. In the training, the size of the training images are resized to 256. In the testing, 2000 images are randomly selected from the testing images as our testing set. In addition, 6 frequently used images, Baboon, Barbara, Boat, Goldhill, Lena and Peppers, are added to our testing set. Our network model is implemented using the Google's Tensorflow framework and trained on a GeForce GTX1080 Ti GPU. In the training, we first train subnetwork T 1 , T 2 for gradient prediction [23]. Then, we train subnetwork T 3 for a coarse recovery. Thereafter, we fix T 1 , T 2 , T 3 and train the residual blocks. Finally, T 1 , T 2 , T 3 and the residual blocks VOLUME 8, 2020 are further fined tuned. For each subnetwork, Adam algorithm [40] is chosen as the optimizer. We select the batch size of 16 and the number of iterations is 200,000 times by experiments. The initial value of the learning factor is set to 0.001 in the first training stage and 0.0001 in the second stage. Learning rate is reduced to 0.7 times in every 20,000 steps. Diminishing learning rate is proven to improve training effectiveness of the model.
We also compare our method with several recent methods, including LUT [14], MLP [33], Hou's DCNN method [21] and our basic version [23]. The LUT [14] method and MLP [33] method are implemented in Matlab. Hou's method [21] is implemented in TensorFlow with Python. All these implementations achieve nearly identical performance VOLUME 8, 2020 as presented in their papers. Peak signal to noise ratio (PSNR) is used as the metric to numerical evaluate the results of inverse halftoning and image expanding.

B. STUDY ON LOSS PARAMETERS
We first study the impact of the loss parameters in the second stage. We test different ratio of α, β, and fix the weighting factor γ = 10.0, since it only makes the value of gradient difference close to other losses. The results are shown in Table 1. As we can see, since the parameters here are only used to fine-tune the whole network after each sub-network is trained separately, the impact of the parameters are insignificant. In general, α = 1.0, β = 1.5, γ = 10.0 works slightly better than other ratios. Therefore, we set α = 1.0, β = 1.5, γ = 10.0 in our experiments.

C. INVERSE HALFTONING
To train our network for inverse halftoning, the Floyd-Steinberg error diffusion (FS) [25] and its edge enhanced version (E-FS) [28] are used to generate halftoned images for training and testing. We experiment with both grayscale and color images by using the same approach. For color images, each channel is processed as a grayscale image individually. We test our method on the 2K testing set of size 256 × 256 and 512 × 512, as well as the 6 classical image size 512 × 512. In the second column, g and c denote gray image and color image respectively. The last two columns is the results of 2K test set with size of 256 and 512. Table 2, our method achieves the best numerical results for both color and grayscale images. In particular, the images containing abundant regular textures have more significant improvement. For example, the PSNR of Barbara (with many black and white stripes) increases 0.76dB over the basic version [23], while the PSNR of Peppers (with fewer textures) only increases 0.24dB.

As shown in
As shown in the visual comparisons in Fig. 3 and Fig. 4, the proposed method recovers the image best, whose fine details are better restored. For example, in Fig. 3(e1), the stripes restored by our method are more distinct than those of other methods. In the third row of Fig. 3, Hou's method [21] and the basic version [23] only restore the outline of the letter ''m'', where there still exist some noise patterns between strokes. In contrast, our method restores more distinct strokes. We owes it to the introduction of the residual learning stage.

D. IMAGE EXPANDING
The second image processing problem we concerned is image expanding. Considering the most frequently used images are 8 bits per pixel per channel, we use 8 bit (per channel) images as the largest bit depth images in our experiments. We first compress the 8 bit images to 2 bit and 4 bit depth images in a channel wise manner, then expand them back to 8 bits. A common method [36] for compressing high bit images is to quantize the color level from 256 to the corresponding low level. Mathematically we can use the formula below to easily convert high bit images to different lower bit outputs.
where P low and P high denote the pixel intensity of the converted smaller and larger bit depth images respectively, l and h denote the bit depth for small bit depth and large bit depth images. We use 8 bit images as the ground truth and smaller bit images preprocessed by (17) as inputs to train our network model. In this paper, we only take 2 bit and 4 bit depth images for experiments. As shown in Table 3, our method can improve the image quality by a large margin for both color and grayscale images in term of PSNR values. As shown in the visual comparisons in Fig. 5-8, our method can also enhance the image details in image expanding. But the improvement is not as obvious as in inverse halftoning. This is because when compressing the bit depth using (17), many of information in the picture has completely disappeared while the halftoning process distributes the quantization error into the neighbourhoods. Our network cannot generate image details out of thin air. However, our method can utilize the existing gradient border to improve the color smoothness of local patches. Therefore, our method results in less blocking and contouring artifacts,compared to Hou's method [21].

V. CONCLUSION
In this paper, we propose a progressive framework for inverse halftoning and image expanding. We first generate a coarse image according given input image, and then generate a residual map to fine-tune the coarse result. Such progressive procedure help us get fine detailed restoration. Experimental results show that our model outperforms the state-of-arts in terms of both visual quality and numerical evaluation. CHAO  He also serves as the Director of the Hunan Key Laboratory of Big Data Research and Application and the Vice Director of the Hunan Engineering Laboratory of Authentication and Data Security. His main interests are network and data security, privacy, data analytics and applications, machine learning, and applied cryptography.
YI XIAO received the bachelor's and master's degrees in mathematics from Sichuan University, in 2005 and 2008, respectively, and the Ph.D. degree in electronic engineering from the City University of Hong Kong, in 2012. He is currently an Associate Professor with the College of Computer Science and Electronic Engineering, Hunan University, China. His research interests include computer graphics, image processing, GPU computing, and neural networks. VOLUME 8, 2020