Learning Super-Resolution of Environment Matting of Transparent Objects from a Single Image

This paper addresses the problem of super-resolution of environment matting of transparent objects. In contrast to traditional methods of environment matting of transparent objects, which often require a large number of input images or complex camera setups, recent approaches using convolutional neural networks are more practical. In particular, after training, they can generate the environment mattes using a single image. However, they still do not have super-resolution capabilities. This paper first proposes an encoder-decoder network with restoration units for super-resolution environment matting, called Enhanced Transparent Object Matting Network (ETOM-Net). Then, we introduce a refinement phase to improve the details of the output further. The ETOM-Net effectively recovers lost features in the LR input images and produces visually plausible HR environment mattes and the corresponding reconstructed images, demonstrating our method’s effectiveness.


I. INTRODUCTION
Image matting has been used in many real-life applications, such as in image and video editing or in showing the weather map superimposed with the meteorologist commonly seen in our daily TV news. The matting process estimates an alpha matte that separates the foreground object from the background, so that the object can be placed on a new background, which is how the film industry creates special effects. The image matting model [1] is defined as follows: where C denotes the composited pixel value, F and B denote, respectively, the foreground pixel value and the background pixel value. α denotes the opacity, indicating the degree of blending between the foreground and the background. As Eq. 1 shows, it is only able to handle non-transparent objects because it does not take optical properties into account, such as refraction and reflection of transparent objects.
To address this limitation, Zongker et al. [23] introduce the environment matting to capture how light in the environment is refracted and reflected by foreground objects. The new model has the following form: where Φ represents the contribution of light from the environment that is reflected or refracted by the surface of the foreground object. After the work of Zongker et al. [23], many approaches [24]- [28] have been proposed to improve their method, but the proposed methods are still limited by the large number of input images or complex image capture settings. Inspired by the performance of convolutional neural networks in highlevel computer vision tasks, Chen et al. [29] propose a CNNbased approach, called Transparent Object Matting Network (TOM-Net), which learns the environment matte from a single input image and is effective and efficient compared to earlier works.
Considering that the input images used are usually of low quality, combining super-resolution capabilities with environment matting would be a good combination. The process of reconstructing an HR image from a single LR image is called Single Image Super-Resolution (SISR). While there are a large number of off-the-shelf methods available, simply using an SISR method to super resolve an environment matte will not produce plausible results. In particular, for refractive flows, super-resolution cannot be performed with existing methods. Thus, in our work, we focus on super-resolution of environment matting of transparent objects from a single image. In this paper, we propose a new network called ETOM-Net with three restoration units for super-resolution environment matting. The network effectively recovers lost features in LR input images and produces visually plausible HR environment matte and synthesized images as shown in Fig. 1. In addition to the main phase, we incorporate a refinement phase with residual learning to improve the quality of the HR environment matte and the reconstructed image.

A. SINGLE IMAGE SUPER-RESOLUTION
Super-resolution is the process of generating a high resolution image from a low resolution or degraded image. Superresolution can be divided into two categories depending on how many images are used as input: SISR and Multi Image Super Resolution (MISR). SISR is challenging but is more practical in real-world applications, which is the focus of many recent researchers.
Dong et al. first introduce a neural network model into SISR, called SRCNN [13]. This method uses a three-layer CNN to perform patch extraction and representation, nonlinear mapping and reconstruction. As a milestone, it has a lightweight structure that achieves not only speed for practical use but also better performance than the state-of-the-art conventional methods. FSRCNN [14] is an improvement of SRCNN. Instead of scaling up the LR input at the beginning as SRCNN does, it processes the LR image directly and applies a deconvolution layer at the end to scale the results to the correct size. It has a faster training speed but still maintains a good performance.
Later, ResNet [15] proposed by He et al. incorporates a residual learning framework to improve the training for very deep networks. Since the introduction of ResNet, researchers have explored the possibility of using residual learning in SISR. In VDSR [16], Kim et al. present a very deep convolutional network with global residual learning to achieve a significant improvement in accuracy. DRCN [18] [20], which can stack more residual blocks under the same condition. RCAN [21] uses a residual in residual structure to form a very deep network and a channel attention mechanism to adaptively rescale channel-wise features, which performs highly accurate SISR.
Later on, Cheng et al. combine an encoder-decoder network with a residual-in-residual structure, which includes several residual channel-wise attention blocks inspired by RCAN, named EDRN [22]. In EDRN, it adopts a coarseto-fine structure, which can gradually recover the lost information and reduce the noise impact. They also use batch normalization in real SISR, which has been shown to be inefficient for SISR with synthetic datasets. The results show that applying BN to downsampling or upsampling convolutional layers yields a performance improvement without a significant increase in execution time. EDRN can effectively restore HR images from real-world LR images and is one of the best methods of NTIRE 2019 Real SR Challenge.

B. ENVIRONMENT MATTING
Environment matting first introduced by Zongker et al. [23] captures not only a foreground object and how the light is attenuated passing through it but also how the object refracts and reflects light from the scene. The foreground object can then be composited into a new environment with physically correct reflection and refraction effects from the environment. They use three monitors to display a series of magenta and green stripes and a digital camera to capture the scene so they obtain the environment matte by identifying background areas corresponding to each foreground pixel.
Chuang et al. [24] further extend the original environment matting in two distinct directions. The first is to utilize more backdrops to capture complex and subtle object refraction and reflection. The second is to obtain a simplified matte using only one image, which allows them to achieve a real-time environment matting of colourless objects in motion.
Both methods assume that some region in the background maps to a foreground pixel in the image. However, Wexler et al. [25] believe that a probabilistic model-based approach, which assumes that each background pixel has a probability of contributing to the colour of some foreground pixel and does not require complex calibration setup, is a better choice. Their method has a limitation that diffuse scattering affects the estimation of probability densities.
Peers et al. [26] use a series of wavelet patterns to obtain the environment matte of a scene while capturing the effect of diffuse reflections, which is not possible in previous methods and reduce post-processing time by linearly combining a large number of basis images.
Zhu and Yang [27] introduce a frequency-based environment matting method. Instead of analyzing images in the time domain, their method uses Fourier analysis to analyze the data in the frequency domain, which can obtain a more physically correct result at the expense of requiring many images. Later, Qian et al. [28] incorporate compressive sensing theory to the frequency-based environment matting, resulting in higher performance but with much fewer number of images.
TOM-Net [29] is a CNN-based environment matting approach proposed by Chen et al. They design a deep learning framework to learn the mapping between a single input image and the corresponding environment matte, including an object segmentation mask, an attenuation map and a refractive flow field by assuming that the foreground object is transparent, has no colour, and has only one mapping at each point. They can then composite a new image using the output matte and a new backdrop. They also created a large-scale synthetic dataset and a real dataset for training and testing. Their approach is effective and efficient, requiring no cumbersome capture procedures and lengthy processing times, and it still yields visually pleasing results. Although Chen et al. have explored the potential of CNN-based environment matting, their method TOM-Net does not have the super-resolution capability that is practical in real-life situations where the input images are usually of low quality. Such a limitation motivates this research.

A. SUPER-RESOLUTION
Low-resolution images can be seen as a degradation of highresolution images. In general, HR images and LR images are linked by this model: where ⊗k represents the convolution operation with blur kernel k, ↓ s denotes the downsampling operation of the scale factor s, and n denotes the additive noise. Since our main focus is on synthetic data, we assume that bicubic downsampling and Gaussian blur are used in our work to generate LR images from the ground truth.

B. ENVIRONMENT MATTING
Following the work of [24] and [29], we first assume that the foreground object is colourless and transparent because too many optical properties would make the model too complex to obtain good results from it. For refraction, Wexler et al. [25] assume that each background pixel has a probability to contribute to some foreground pixel, and Zongker et al. [23] assume that each foreground pixel is a linear combination of pixel values of a region in the background. In our work, we assume that there is no reflection of the foreground object, and in the single background setting, each foreground pixel comes from a single pixel in the background.
With these assumptions, similar to [29] , the transparent environment matting problem can be modelled as follows: In this model, O denotes the composited image, I mask , I rho , I ref denotes the pixelwise mask of the foreground object, the attenuation map of the foreground object, and the background image, respectively. The mask I mask ∈ {0, 1} has two values, and I mask (i, j) = 0 denotes that the pixel at (i, j) is a background pixel and vice versa. The amount of attenuation map I rho ∈ [0, 1] indicates how much the object attenuates the light.
S() is a function that re-samples the image using the background image values and pixel locations from a flow-field grid, and the computation is done by bilinear interpolation. The grid specifies the normalized sampled pixel positions, with most values in the range of [−1, 1], and it is generated by the function G() using a two-channel refractive flow I f low that represents an offset (V x , V y ) between the composited image and its corresponding background image.
The function G() is a flow-field grid generator that first generates a two-dimensional base grid that has values from the left to the right and from the top to the bottom from 0 to width and height, respectively. It then scales this base grid to [−1, 1] and adds the input refractive flow element-wise to this scaled base grid to form a flow-field grid as the input to S().
From Eq. 4, the environment matting problem can now be solved by estimating an environment matte which includes a pixelwise mask I mask , an attenuation map I rho , and a refractive flow field I f low from a single input image, as shown in Fig. 1. Note that I rho and I f low only apply to the region where I mask = 1, and outside of this region, we use the corresponding pixels in the background as the composited pixels. So the quality of I mask has a significant influence on the reconstructed image.

IV. PROPOSED METHOD A. ARCHITECTURE
The main phase of our proposed method ETOM-Net is shown in Fig. 2. Similar to [29] and [30], it contains an encoderdecoder structure with a shared encoding process and three   independent decoding processes corresponding to the three output environment mattes.
In this structure, we use six encoders and eighteen decoders inspired by the work of [29]. Every three decoders form a combination that shares the same input, which allows the three decoding processes to learn features from each other, and so the three output environments mattes are more correlated.
Each encoder contains two convolutional layers with steps equal to 1 and 2, and two ReLU activation layers, forming a factor of 64 for downsampling. Each decoder has one convolutional layer, one ReLU activation layer and one upsampling layer that recovers the resolution downsampled by the encoder. Skip connections are also used to connect feature maps of the same size during the encoding and decoding processes. The encoding process can be represented as: and the output of each decoder can be formulated as: where O E,i denotes the output of the i-th encoder, O D,i,j denotes the output of the decoder (i, j), i.e., the j-th decoding process of scale i. E i denotes the i-th encoder, D i,j denotes the decoder (i, j). O E denotes the output of the encoding process of the same feature dimension as O D,i−1,j . We add three new Restoration Units (RU) after the encoding process and before the decoding processes, which allows the main phase of the network to focus on more informative parts of the LR input and also to enhance the discriminative power of the network. Fig. 3 shows the structure of the restoration unit. Each RU consists of four Residual In Residual Blocks (RIRB), which is inspired by the work of [22], and a convolutional layer, with each RIRB stacked with ten Residual Blocks (RB) and one convolutional layer. The output of the RU can be formulated as: where O RU denotes the output of RU, and I RU denotes the input of RU. Conv and RIRB i denote a convolutional layer and the i-th RIRB block, respectively. And the output of RIRB i can be obtained by: O RIRB,i = Conv(RB 9 (· · ·(RB 0 (I RIRB,i )))) + I RIRB,i , where O RIRB,i denotes the output of the i-th RIRB, I RIRB,i denotes the input of the i-th RIRB and RB i denotes the i-th RB block.
Within each RB, we utilize two convolutional layers, with a ReLU activation layer between them and a Residual Channel-wise Attention Block (RCAB) [21] at the end. The RCAB has a global average pooling at the beginning and a Tanh activation layer at the end. For residual learning, the inputs of RU, RIRB and RB are added to their outputs, the input of RCAB is multiplied to its output as well. The formulation of RB i can be represented as: and the output of RCAB can be formulated as: O RCAB = T anh(Conv(ReLU (Conv( P ooling(I RCAB ))))) * I RCAB , where O RB,i denotes the output of the i-th RB, I RB,i denotes the input of the i-th RB, O RCAB denotes the output of RCAB, I RCAB denotes the input of RCAB. ReLU , P ooling and T anh denote the Rectified Linear Unit, the average pooling and the hyperbolic tangent function, respectively. Inspired by [29], we train the main phase of ETOM-Net with four different loss scales. This multi-scale loss starts with a feature map size of 64*64*64 and ends with a size of 8*512*512 (the same size as the output mattes), named scale 0 to scale 3. In addition, we apply different weights to different loss scales to make the network more focused on large-scale features. The scale of the super-resolution in our proposed method ETOM-Net is set to ×2, and can be extended to ×3 and ×4 by training the model using different scales.
Along with the main phase, we add a refinement phase using residual learning to produce more detail to the output mattes of the main phase. As shown in Fig. 4, the refinement phase takes the LR input, and three output environment mattes from the main phase as input, and then the input tensor is passed through several downsampling blocks, five RBs and several upsampling blocks to form the output mattes. Each RB consists of two convolutional layers and a ReLU activation layer, the input of RB is then subjected to an average pooling operation and added to the output of RB as the final output. The output of the refinement phase can be represented as: O ref ine = Conv(U p(RB 4 (· · ·(RB 0 (Down( I mask + I f low + I rho + I lr )))))), and the output of RB i can be formulated as: where O ref ine denotes the output of the refinement phase, U p denotes the upsampling process, and Down denotes the downsampling process. I mask , I f low , I rho , and I lr denote the inputs of the refinement phase, including a pixelwise mask, a refractive flow, an attenuation map, and an LR input, respectively.

B. LOSS FUNCTION 1) Main Phase
The loss function L main of the main phase is divided into four parts similar to [29]: a pixelwise mask loss L mask , a refractive flow field loss L f low , an attenuation loss L rho and a reconstruction loss L rec . The loss function of the main phase can then be denoted as where λ 1 , λ 2 , λ 3 , λ 4 are the weights of each component of the loss.

Segmentation mask loss
We define pixelwise mask segmentation as a typical classification problem. The output mask has two channels, representing the probabilities of the foreground and the background, respectively. Simply put, a pixel is part of the transparent object if the value of its first channel is more significant and vice versa. In this case, we compute the mask loss L mask using the cross-entropy loss where P ij = (P f ore , P back ) denotes the probability of the pixel at (i, j) belongs to the foreground and the background, respectively, and C ij ∈ {0, 1} denotes the ground truth of the pixel at (i, j) (C ij = 0 means the pixel at (i, j) is a foreground pixel). Mean denotes the average of all pixels.

Refractive flow field loss
The output refractive flow of the main phase has two channels, representing the horizontal and vertical displacements, respectively. The output value is in the range of [−1, 1] because of the T anh activation function. We multiply them by different ratios at different scales, so the output flow has the same range as the width. For this one, we use the average endpoint error (AEE) loss, which is defined as the mean of the Euclidean distance (Frobenius norm) between the estimated flow and the ground truth flow where (F x ij , F y ij ) denotes the output refractive flow at (i, j), and (F x ij ,F y ij ) denotes the ground truth refractive flow at (i, j).

Attenuation map loss
The value of the output attenuation map is in the range [0, 1], showing how much light can pass through the object. We use the mean square error (MSE) loss where A ij is the output attenuation map at (i, j), andÃ ij is the ground truth attenuation map at (i, j).

Reconstruction loss
To evaluate the quality of composited images, we reconstruct the image using the output environment mattes and the corresponding HR ground-truth background and compare it to the ground truth HR input image. As with the attenuation map loss, we use the MSE loss  where V ij is the pixel value of (i, j) in the reconstructed image, andṼ ij is the pixel value of (i, j) in the ground truth image.

2) Refinement phase
Similar to the main phase, the loss function L ref ine of the refinement phase has two parts: a pixelwise mask segmentation loss L mask and a refractive flow field loss L f low . The loss function of the refinement phase can then be denoted as where λ 1 , λ 2 are the weights of each component of the loss, and the L mask and the L f low are the same as in the main phase.

C. COMPARISON
Here, we compare the similarities and differences between TOM-Net [29] and our proposed method ETOM-Net. Fig. 5 illustrates a brief comparison which does not include implementation differences within each block. In the main phase, similar to TOM-Net, we use an encoderdecoder structure with three independent decoding processes to generate three environment mattes. In addition, both methods use skip connections to connect feature maps of the same size and multi-scale losses with four different scales.
Unlike TOM-Net, our approach takes LR images as input and predicts HR environment mattes by adding three RUs between the encoding and decoding processes. Each RU consists of four RIRBs, which are stacks of residual blocks using channel-wise attention mechanism and residual learning, as shown in Fig. 3.
In the refinement phase, both methods use residual learning to refine the mattes predicted by the main phase. However, our refinement phase takes upsampled LR image and the output mattes of the main phase as input and predicts only refined segmentation mask and refractive flow field with the same attenuation map as the input.
As we mentioned before, the quality of the mask does have a significant impact on the reconstructed image. Thus, compared to the refinement loss in the TOM-Net, we add a mask loss to improve the quality of the output mask further as its edges are not smooth enough after being super-resolved in the main phase. Moreover, we remove the attenuation map loss because the output attenuation map of the main phase is good enough that the refinement phase cannot improve it in any way, and training with it will slow down the convergence of the mask and refractive flow. As with TOM-Net, we do not include reconstruction loss in the loss function during the refinement phase because it does not help to preserve the sharp edges of the refractive flow field.

A. DATASET
Chen et al. [29] created a large-scale synthetic dataset because there was no off-the-shelf dataset for transparent object matting. This dataset consists of background images, input images, ground truth segmentation masks, attenuation and refractive flows, with a total of 178,000 samples for training. They also created a validation dataset with 900 samples for testing.
In our work, we also use this dataset, and since we are mainly concerned with super-resolution environment matting, and TOM-Net has demonstrated good generality of multi-scale encoder-decoder structures from basic to complex shapes, we use only part of their dataset (glass and glass with water) to reduce the training and testing time. Therefore, we used a dataset with 60,000 training samples and 400 validation samples, which saved us much time, and we could use them for the ablation study of several variants of the proposed method.

B. IMPLEMENTATION DETAILS
For training settings, we used a batch size of eight in the main phase and four in the refinement phase, with learning rates starting at 0.0005 and 0.0002 for the main and refinement phases, respectively. We also decayed the learning rate by half every five epochs. For the optimizer, we used the Adam algorithm (β 1 = 0.9, β 2 = 0.999, = 1e-08) and applied L2 penalty to prevent overfitting.
For the loss function Eq. 13 in the main phase, we set the segmentation mask weight λ 1 = 1, the refractive flow weight λ 2 = 0.1, the attenuation weight λ 3 = 10, and the reconstructed image weight λ 4 = 10. For the refinement phase loss in Eq. 18, we set the segmentation mask weight λ 1 = 10, and the refractive flow weight λ 2 = 0.1. Since we used four different loss scales in the main phase of training, we weighted them by 1/8, 1/4, 1/2 and 1 from scale 0 to scale 3, respectively, to make the network focus on the larger scale.
During the training process, the input images in the dataset are downsampled by a factor of two and used as the input to the ETOM-Net main phase, and subsequently, its output is concatenated with the input of the main phase and sent to the refinement phase. The refinement phase outputs the refined segmentation mask and refraction flow field but does not do anything to the attenuation map. Once the training is complete, we can use the trained models main phase and refine phase to predict an HR environment matte using a single LR input image in a single pass.
Using the Adam optimizer, the processing time on an

C. RESULTS
In the experiments, we compare the refinement phase of the ETOM-Net against the main phase, the TOM-Net (use bicubic upsampled images as input), and a super-resolution method with the TOM-Net, as shown in Fig. 6.
Here, we use the EDRN [22] to achieve super-resolution because it employs an encoder-decoder residual network and VOLUME 4, 2021 a channel-wise attention mechanism, which is similar to the structure of our ETOM-Net. A single input image is fed to the EDRN to produce a super-resolved output image, and then the output is used as the input to the TOM-Net to predict the high-resolution environment matte. We introduce a baseline model to produce the worst results by using the corresponding background image as the output reconstructed image, two one-filled tensors as the output attenuation and segmentation mask to simulate null attenuation and mask, and use a zero-fill tensor as the output refractive flow field to simulate a null flow without offsets.
Quantitative results of ETOM-Net are shown in Table 2.
We can see that all methods are much better than the baseline model, which indicates that they can estimate the environment matte successfully. Compared to the original TOM-Net, SR + TOM-Net produces poorer results in all metrics because the super-resolution approach introduced many unrealistic synthetic details, which affect the performance of TOM-Net. TOM-Net itself can already predict visually good results by taking bicubic upsampled images as input. However, the main phase of our ETOM-Net has beaten TOM-Net in terms of reconstructed images and attenuation. In particular, the  main phase produces much better results than TOM-Net in terms of attenuation, which demonstrates the effectiveness of using the RU with an encoder-decoder network. The ETOM-Net refinement phase further improves the output refractive flow and segmentation mask of the main phase, giving better results for both metrics than the main phase. It also produces the best overall results among all tested methods, showing the effectiveness of the refinement phase. Fig. 7 presents some qualitative results of our ETOM-Net compared to that of the TOM-Net and SR + TOM-Net. Our method produces smoother borders in terms of the output segmentation mask, and the stems of the wine glasses look more natural than that of the other methods. The ETOM-Net output attenuation has more detail, especially in the feet and stems, and the reconstructed images have more detail and more realistic refraction.

VI. ABLATION STUDY
In order to understand the effectiveness of each component in our ETOM-Net, we create several variants of ETOM-Net and analyzed them quantitatively. Here, we focus on the RU, the object mask loss (L mask ) and the attenuation loss (L rho ) in the refinement phase. We remove the RU and half of the RU to create model main -RU and model main -RU (part), respectively, and remove L mask in the refinement phase to create model refine -(L mask ), and include L rho to create model refine + (L rho ). We also add a baseline model by using the same tactics as in the previous section. Similar to that in the experimental results, we use the IoU for evaluating object masks, EPE for refractive flow, and MSE for attenuation and reconstructed image, respectively. The quantitative results are presented in Table 3.
First of all, all variants, including the main and the refine phases of the ETOM-Net, unquestionably exceed the baseline by a large margin in all evaluation metrics. Removing the RU from the main phase degrades the overall performance, and removing half of the RU gives better results than removing all of them, which indicates that the number of RIRBs inside the RU does have an impact on performance.
For the refinement phase, removing L mask or adding L rho to the loss function leads to a decrease in performance, and although the presence of L rho further improves the attenuation metric, the other three metrics decrease as a result. In general, the main phase with the refinement phase produces the best results. Fig. 8 shows the effectiveness of the refinement phase.
We also evaluate how the number of RIRBs within each RU affects the results and training. As shown in Fig. 9, figure VOLUME 4, 2021 (a) shows the relationship between the mean square error of the output reconstructed image and the number of RIRBs, and figure (b) shows the effect of the number of RIRBs on the GPU memory footprint (megabytes). From Fig. 9 (b), we can see that the GPU cost is positively correlated with the number of RIRBs. We used a batch size of eight in our tests, and for eight and sixteen RIRBs, we had to halve the batch size to four in order to enable the model to be trained, and in the absence of gradient clipping, the training process was volatile because of the presence of gradient explosion. From Fig. 9 (a), the results are getting better from no RIRBs to eight RIRBs, especially from two to four, achieving a leap in performance. For sixteen RIRBs, the performance drops, probably because of the overly complex network. By weighing the performance gain against the GPU footprint of training, we choose four as the number of RIRBs within each RU in our model.

VII. CONCLUSION
In this paper, we combine SISR and environment matting to propose an efficient CNN-based method for super-resolution of environment matting of transparent objects called ETOM-Net.
The proposed network uses an encoder-decoder architecture with skip connections and multi-scale losses that takes a single low-resolution image as input and then estimates the corresponding high-resolution environment mattes, including a refractive flow field, a pixelwise segmentation mask, and an attenuation map.
Three restoration units are added between the encoding and the decoding processes to allow the network to focus on more informative parts of the low-resolution input and restore more details to the high-resolution output environment mattes. Furthermore, a refinement network using residual learning is introduced to improve the details of the output segmentation mask and the output refractive flow of the main phase further.
ETOM-Net produces visually plausible results and outperforms the baseline model by a large extent. Compared to the TOM-Net and SR+TOM-Net models, the main phase of our method already outperforms them in terms of the reconstructed image and the attenuation map. With the refinement phase, ETOM-Net produces the best overall results among all models, which demonstrates the effectiveness of our proposed method.
Although the proposed method ETOM-Net is very effective, it is limited by the fixed-scale super-resolution and can only be applied to transparent objects with a single mapping (one mapping at each point), which we will explore in future work.