A Flow-based Generative Network for Photo-Realistic Virtual Try-On

Image-based virtual try-on systems aim at transferring the try-on clothes onto a target person. Despite making considerable progress recently, such systems are still highly challenging for real-world applications because of occlusion and drastic spatial deformation. To address the issues, we propose a novel Flow-based Virtual Try-on Network (FVTN). It consists of three modules. Firstly, the Parsing Alignment Module (PAM) aligns the source clothing to the target person at the semantic level by predicting a semantic parsing map. Secondly, the Flow Estimation Module (FEM) learns a robust clothing deformation model by estimating multi-scale dense flow fields in an unsupervised fashion. Thirdly, the Fusion and Rendering Module (FRM) synthesizes the final try-on image by effectively integrating the warped clothing features and human body features. Extensive experiments on a public fashion dataset demonstrate that our FVTN qualitatively and quantitatively outperforms the state-of-the-art approaches. The source code and trained models are available at https://github.com/gxl-groups/GFGN.


I. INTRODUCTION
A S online shopping has continued to grow in popularity, virtually trying on clothes in an online fitting room has achieved much attention in recent years. A photo-realistic virtual try-on system will not only enhance the user shopping experience by fitting different clothes without changing them physically but also improve sales for retailers. This motivates many companies to develop various virtual fitting technologies, such as SenseMi 1 , triMirror 2 , etc.
Classical virtual try-on methods primarily rely on computer graphics to synthesize the try-on looks for users based on their 3D body shapes, desired poses and target clothing items [2], [28], [44], which can well control clothing deformation and material performance. However, the huge labor costs for 3D data annotation and upfront costs for scanning equipment inhibit their large-scale deployment [14].
Motivated by the rapid development of image synthesis methods [10]- [12], [19], [21], the image-based virtual try-on methods using generative models provide a more economical solution, the goal of which is to naturally warp the try-on clothes on a target person without leveraging any 3D infor-mation. Although image-based virtual try-on makes considerable progress recently, generating perceptually convincing virtual try-on images is highly challenging for the real-world scenario. The main challenges lie in: (1) Occlusion occurs in the target person. For example, the target person's arms may cross over the chest and occlude the clothing region. (2) Varying deformation exists among different human poses and shapes of the target person (e.g., limbs from non-overlapping to overlapping), which makes it extremely hard to deform the garments and well fit the posture and body shape of the target person. (3) Due to the drastic spatial deformation from source clothing to the target person, generating the try-on image that maintains the detailed visual features of the original garment such as texture and color is a non-trivial task.
The aforementioned challenges can be ultimately summed up as two key problems for tackling image-based virtual try-on. The first one is how to design a robust geometric deformation scheme to warp the source clothing for fitting the target person? Existing approaches relying on the affine or TPS transformation for warping the source clothing [6], [14], [36], [43] fail to generate precise appearance details because such methods cannot deal with the transformations of nonrigid objects such as clothes. Recently, flow-based methods [13], [30], [46] show advantages in learning complex nonrigid geometric deformation in comparison to the affine transformation approaches. Inspired by it, we propose a novel flow-based spatial alignment scheme for precisely capturing the clothing deformation. The second one is how to render the final try-on image by effectively fusing the contents of body parts and warped clothes? The quality of try-on look highly depends on the appearances of garments (e.g. texture, logo and color) as well as the characteristics of the target person (e.g. hair, face and arms). Previous approaches [14], [36] using a composition mask to integrate clothing and human body bring obvious boundary artifacts in the intersection regions. These approaches overlook the occluded regions and fail to synthesize the body parts flexibly.
In this paper, we propose a novel Flow-based Virtual Try-on Network (FVTN), which consists of three modules. The first is the Parsing Alignment Module (PAM), aligning the source clothing to the posture of the target person at the semantic level. This module provides accurate spatial information for subsequent modules. The second one is the Flow Estimation Module (FEM), which learns clothing deformation by estimating multi-scale dense flow fields in an unsupervised fashion. The predicted multi-scale flows are used to establish visual correspondence between the source clothing and the target try-on clothing in the feature domain. The learned flow fields do not directly warp the source clothing at the pixel level but the feature level. This is because warping clothing at the pixel level would result in the model having difficulty extracting large motions and generating new contents [30]. The final part is the Fusion and Rendering Module (FRM), aligning the source clothing to the target person at the pixel level. By effectively integrating the warped source clothing features and the body features, the proposed FRM can generate accurate clothing appearances and fine details of the human body. Experiments on the VITON dataset [14] demonstrate that the proposed FVTN can produce photo-realistic and perceptually convincing tryon images.
The main contributions of our work can be summarized as follows: • We propose a new flow-based generative network with three tailored modules for image-based virtual try-on. • We design a novel spatial alignment scheme in the flow estimation module to precisely capture clothing deformation by estimating multi-scale dense flow fields in an unsupervised fashion. • We present a novel image synthesis network to synthesize the final try-on images by integrating information from the warped clothing features and the body features. • Experimental results on VITON [14] verify that our method qualitatively and quantitatively outperforms the state-of-the-art methods.

A. VIRTUAL TRY-ON
Conventional approaches for virtual try-on works are based on graphics models. For instance, Sekine et al. [32] introduced a virtual fitting system that adjusts 2D clothing images to users by estimating their 3D body shape models from single-shot depth images. Yang et al. [41] computed a 3D model of a human body and outfits from a single-view image.
Pons-Moll et al. [28] used a multi-cloth 3D model of the body and clothing for capturing a clothed person in motion and retargeting the clothing to new body shapes. Patel et al. [27] proposed TailorNet for estimating clothing deformation in 3D as a function of three factors: body shape, body pose and garment style. Mir [26] proposed Pix2Surf to digitally map the texture of clothing images to the 3D surface of virtual garment items, which enables 3D virtual try-on in real-time. 3D methods can generate good results for virtual try-on, but usually, they require additional 3D measurements. Compared to graphics models, image-based generative models are more computationally efficient and broadly applicable. For example, VITON [14] first proposed image-based virtual try-on method, which generates warped clothes using Thin Plate Spline (TPS) transformation and maps the texture to the refined result with a composition mask. CP-VTON [36] improves VITON by using neural networks to directly learn the parameters of TPS for clothing warping, and thus achieves more accurate alignment results. CP-VTON+ [25] outperforms CP-VTON by improving the clothing warping stage and blending stage. VTNFP [43] achieves better tryon results than CP-VTON and VITON by concatenating the high-level features extracted from the body parts and the bottom garment, since CP-VTON and VITON only focus on the upper garment. ACGPN [40] synthesizes try-on images preserving both the characteristics of clothes and details of the human identity by using three modules. Xintong et al. proposed ClothFlow [13] for handling pose-guided synthesis and image-based virtual try-on. Similarly, ClothFlow and our proposed FVTN both learn clothing deformation by using flow-based methods. However, different from ClothFlow, we leverage an unsupervised flow training scheme relying on the photometric loss [42]. Furthermore, our FVTN uses the learned flow to warp the garments at the feature level instead of at the pixel level for extracting the large motions and generate new contents.

B. OPTICAL FLOW
Optical flow [16], [17] is the task of estimating dense pixelto-pixel correspondence between two input images, which is widely used in many applications such as action recognition, motion tracking, video segmentation and 3D reconstruction. Optical flow has traditionally been approached as a handcrafted optimization problem, the objective of which is defined as a trade-off between a data term and a regularization term [1]. Recently, deep learning has been shown as a promising alternative to traditional methods. FlowNet [7] is the first trainable CNN for optical flow estimation. FlowNet2 [18] improves the flow accuracy of FlowNet by cascading several variants of it. Subsequently, Ranjan and Black introduced SpyNet [29], a compact spatial image pyramid network, which warps images at multiple scales to cope with large displacements. Recent notable contributions to end-to-end trainable optical flow include PWC-Net [34] and LiteFlowNet [18]. They proposed to use the feature warping and cost volume at multiple pyramid levels in a coarse-to-fine estimation, yielding more compact and effective networks. We draw inspiration from those coarse-to-fine flow estimation methods.
To avoid annotating labels, Meister et al. [24] proposed an end-to-end unsupervised learning approach by designing a bidirectional flow-based loss function. Wang et al. [38] further proposed an unsupervised learning framework that models occlusion and large motions. Liu et al. [23] proposed SelFlow that distills reliable flow estimations from nonoccluded pixels using self-supervised training. Unsupervised optical flow estimation is closer to our setting. However, different from these works, we focus on learning a flow for establishing correspondence between the source clothing and target try-on clothing.

III. PROPOSED METHOD
As shown in Fig. 1, our FVTN is composed of three modules. The first one is the Parsing Alignment Module (PAM), which transfers the source clothing onto the target person at the semantic level. The proposed PAM provides accurate spatial and semantic information for subsequent modules. The second one is the Flow Estimation Module (FEM), which learns diverse spatial deformation between the source clothing and the target try-on clothing by estimating multiscale dense flow fields in an unsupervised way. The final part is the Fusion and Rendering Module (FRM), which fuses the warped features of source clothing and the features of the human body for rendering the final try-on image. Fig. 2 illustrates details of these three modules.
Ideally, we need an image triplet I s , I t , I r to train the FVTN, where I s is the source clothing image, I t is the target person image, and I r stands for the ground-truth image. However, such a dataset is hard to obtain. Therefore, I r is replaced with I t to train the FVTN in our implementation.

A. PARSING ALIGNMENT MODULE (PAM)
To disentangle the generation of shape and appearance, PAM aligns the source clothing I s to the target person I t at the semantic level. It takes the semantic mask of source clothing M c s , the segmented source clothing I c s , the pose of the target person P t and the binary mask of the target person's head M h t as input to predict the target semantic parsing map M t . The predicted parsing map is required to retain the body parts and the pose of the target person as well as accurately show the shapes and categories of the transformed source clothing.
We use a human parser [9] to compute the parsing map with 20 semantic labels for I s and I t . Each parsing map is represented as a one-hot tensor with 20 channels. On the other hand, we use a state-of-the-art pose estimator [3] to estimate the pose of the target person. Following [36], P t is represented as 18-channels heat maps that each one encodes one joint of a human body.
In this module, we simply adopt a conditional generative adversarial network [37], in which a U-Net structure is used as the generator while a discriminator is utilized to distinguish generated parsing map from the ground-truth parsing map. The overall objective function for PAM is formulated as:  where L adv is the adversarial loss [37] and L seg is the pixel-wise cross-entropy loss. λ adv and λ seg are the tradeoff parameters for these two loss terms, which are set to 0.2 and 1, respectively, in our experiments. The pixel-wise cross-entropy loss L seg constrains pixellevel accuracy during semantic parsing map generation, which is defined as: where H, W and C are height, width and the number of channels of the parsing map, respectively. M t is the generated parsing map andM t is the ground-truth.

B. FLOW ESTIMATION MODULE (FEM)
As we've discussed, building a robust clothing deformation model is crucial for image-based virtual try-on. Early methods of image-based virtual try-on [14], [25], [36], [43] warp clothes by computing a Thin Plane Spline (TPS) transformation. However, because of its low degree of freedom, TPS transformation can only model limited geometric transformations and is inflexible to achieve complex and nonrigid deformation [46]. Considering that flow-based methods can capture complex non-rigid geometric deformation [13], [30], [46], we design an unsupervised flow-based clothing deformation scheme without using explicit correspondence annotation.
With the predicted parsing map M t , we first obtain the semantic mask of target clothing M c t . FEM takes M c s and M c t as input to predict multi-scale dense flow fields for establishing visual correspondence between the source clothing and the target try-on clothing in the feature domain. To deal with the drastic spatial deformation existing between the source clothing and the target try-on clothing, we estimate the flow fields in an iterative manner, where the flow is first estimated at low resolution followed by upsampled and refined at high resolution. Specially, we deploy a two-stream weight-sharing Feature Pyramid Network (FPN) to extract two feature pyramids from M c s and M c t , that is, .., f t (N )}, respectively, where N corresponds to the lowest spatial resolution (in our case N = 5) and 1 corresponds to the highest spatial resolution. The extracted multi-scale features will be used to estimate the flow from the source clothing to the target one in an unsupervised way. Beginning with the lowest spatial resolution, after concatenating f s (N ) and f t (N ), a flow estimation layer initially infers a coarse flow F N . Formally, where Def denotes the deformable convolution [4] layer. We replace the standard convolution with the deformable convolution in the flow estimation layer for improving the network's ability to handle drastic spatial deformation, since the standard convolution is limited by the lack of ability to spatially transform the inputs [4]. At a higher spatial resolution, the flow estimation layer gets a refined flow F N −1 by computing a residue flow R N −1 and adding the upscaled flow field F ↑2 N , as illustrate in Fig.  2 where W is warping operation with bilinear interpolation when the flow field falls into a sub-pixel coordinate. This allows end-to-end training via stochastic gradient descent [47]. Note that the resolution of F N is upsampled with bilinear interpolation and its value is doubled. Such process will be repeated until inferring the finest flow F 1 from the two pyramial features with the highest spatial resolution f s (1) and f t (1). Finally, F 1 is upsampled and its value is doubled to F 0 = 2F ↑2 1 . Inspired by [35], the flow estimation layers share weights between iterations for speeding up model training and reducing the amount of model parameters.
Since we do not have the ground-truth flow, we leverage an unsupervised flow training scheme relying on the photometric loss [42] with the clothing images. The overall objective function for FEM is formulated as: (6) where L phot is a multi-scale photometric loss, L T V is a flow regularization loss and L perc1 is the perceptual loss [20]. λ phot , λ T V and λ perc1 are the trade-off parameters for these three loss terms, which are set to 5, 2 and 1, respectively, in our experiments.
The multi-scale photometric loss L phot sums the photometric loss between the source clothing regions and the target one at multiple scales for fast convergence [46], which is defined as: where ρ(x) = (x 2 + 2 ) α is a penalty function for mitigating the effects of outliers [42]. I c t and I c s is the segment image with the clothing regions of the target clothing and the source clothing, respectively. And i represents the spatial resolution of images and flows. Note that I c t (i), I c s (i) and F i have the same spatial resolution.
The flow regularization loss L T V is a total variationbased (TV) smoothness penalty term to regularize the flow prediction, which is defined as: Unlike previous methods [13], [46] that regularize the multiscale flows, we apply smoothness loss on the coarse flow F N and the multi-scale residue flows. In order to preserve realistic details and textures of source clothing, we add the perceptual loss between I c t and the warped source clothing segment image (i.e., W (I c s ) = W(I c s , F 0 )). Specifically, the perceptual loss L perc1 models the distance between I c t and W (I c s ) in a feature space, which is defined as: where N l is the number of chosen layers. And φ i (I c t ) denotes the feature map of image I c t at the i-th layer in a VGG-19 [33] network pre-trained on ImageNet [5]. β i is the hyperparameter that controls the contributions of different layers and is set by following [14].

C. FUSION AND RENDERING MODULE (FRM)
Going beyond the clothing deformation model, it is another great challenge to render the final try-on image by fusing the contents of the human body and warped clothes. FRM accepts the segmented source clothing I c s , the body parts of the target person I b t , the head region of the target person I h t , the parsing map of the target person M t and the pose of the target person P t as input to synthesize the photo-realistic tryon image I t .
Specifically, FRM adopts three encoders of the same architecture, i.e., ENC c , ENC b and ENC h , to encode the features for the source clothing, the body of the target person, the head of the target person, respectively. Note that the three encoders do not share weights during training. ENC c extracts source clothing features from I c s through N downsampling layers. Formally, where f c (n), n = 1, ..., N denotes the extracted clothing features after n downsampling layers. ENC h extracts the head features of the target person separately from the body parts to enhance the siginals of facial features and hair features. Formally, where f h (n), n = 1, ..., N denotes the extracted head features after n downsampling layers. Likewise, ENC b extracts the body features of the target person conditioned on I b t , M t and P t , where f b (n), n = 1, ..., N denotes the extracted body features after n downsampling layers. With the removed clothing and head regions, the arms and legs are represented as body parts. Next, the final try-on result I t is generated through N decoding blocks, where each decoding block accepts the concatenated warped clothing features, body features and head features. Formally, where the clothing feature f c (n) is warped via the predicted flow F n from previous module. Note that f c (n) and F n is ensured to have the same spatial resolution. Tanh function is applied after f where M b t is sampled from the Irregular Mask Dataset [22]. Without M b t , FRM tends to learn an identity mapping for the body parts (i.e., arms and legs). For example, when transferring a long-sleeve garment to the target person in a short-sleeve one, the arm parts should be rendered with clothing textures instead of retaining the original arms. In the opposite case, when transferring a short-sleeve garment to the target person in a long-sleeve one, the arm parts should be synthesized instead of retaining the original clothing textures. By introducing M b t , FRM can adaptively determine the generation or preservation of the body parts.
The overall objective function for FRM is formulated as: where L L1 is the reconstruction loss, L perc2 is the perceptual loss [20] and L style is the style loss [8]. λ L1 , λ perc2 and λ sty are the trade-off parameters for these three loss terms, which are set to 1, 1 and 400, respectively, in our experiments. The reconstruction loss is the L1 loss between the synthesized image I t and the ground-truth image I r , which is defined as: The perceptual loss L perc2 models the distance between the synthesized image I t and the ground-truth image I r in a feature space, i.e., L perc2 (I t , L style is the style loss that matches style information between the synthesized image and the ground-truth image, which is defined as: where ψ i (I t ) denotes the Gram matrix [8] of image I t at the i-th layer in a VGG-19 [33] network pre-trained on ImageNet [5]. γ i is the hyper-parameter that controls the contributions of different layers and is set by following [13].

A. IMPLEMENTATION DETAILS
Dataset. We conduct the experiments on the VITON dataset [14] to evaluate the proposed FVTN on virtual try-on task.
We follow [36]   consists of five encoding layers where each layer is a convolution layer with kernel 3 and stride 2 followed by one residual block. Besides, we use a deformable convolution with kernel 3 and stride 1 as the flow estimation layer. FRM is trained with a minibatch of size 8 and a learning rate of 0.0001 for 20 epochs. In FRM, three encoders share the same architecture, consisting of five downsampling layers where each layer contains two convolution layers with kernel 3 and stride 2 and with kernel 3 and stride 1, respectively. Evaluation Metrics. We adopt four widely used evaluation metrics to evaluate the quality of the synthesized images. Inception Score (IS) [31] is used to measure the quality and diversity of the generated images. Structural Similarity (SSIM) [39] is used to measure the similarity between the generated images and ground-truth images. Fréchet Inception Distance (FID) [15] is used to measure the realism of the generated images by computing the Wasserstein-2 distance between distributions of the generated images and groundtruth images. Learned Perceptual Image Patch Similarity (LPIPS) [45] is used to measure how similar are two images by computing the distance between the generated images and generated images at the perceptual domain.

B. EVALUATIONS
We mainly perform visual comparison of our method with recent proposed virtual try-on networks [13], [25], [36], [40]. Quantitative Results. Table 1 reports the quantitative comparison between our approach and the baselines. Except for SSIM and IS, our method significantly outperforms all baselines on FID and LPIPS. The IS metric provides a proxy to evaluate the performance but it is not a good measurement of how well the model is performing our task. Although our method gets the second-highest SSIM score, the FID and LPIPS more accurately reflect the similarity between the synthesized images and the ground-truth images. Qualitative Results. Figure 3 presents a visual comparison of the evaluated methods, where the first column is the garment image, the second column is the target person image and the other columns are the synthesized virtual try-on images with different approaches. We observe that  during learning clothing deformation. However, ClothFlow gets unsatisfactory human body parts because of its simple try-on image rendering model. ACGPN outperforms these three methods because of its proposed second-order spatial transformation constraint and inpainting module. However, we notice visible high-frequency artifacts in the collar regions of try-on images generated by ACGPN. By contrast, our proposed FVTN generates more perceptually convincing synthetic results which warp the garments more naturally and align them with the human body more accurately. Figure  4   The quantitative evaluation results are shown in Table  1. From the results, we find that our model outperforms all the variants on all metrics and all components improve performances in different degrees. The qualitative evaluation results are visualized in Fig. 5. Consistent with the quantitative evaluations, our model surpasses all the variants with the highest quality of visual results. Besides, we obtain the following observations: (1) w.o/mask cannot adaptively determine the generation or preservation of the body parts. (2) w.o/iter generates very poor appearances of the garments.  20 participants. Following [40], given two generated images, each participant is asked to choose a better and more realistic image meeting three criteria: (1) how well the target clothing characteristics of the source clothing image are preserved; (2) how photo-realistic the whole image is; (3) how good the whole person seems. Our FVTN achieves significantly better human evaluation scores on the image-based virtual try-on, as shown in Table 2. The results of the user study are consistent with those of qualitative and quantitative experiments, which demonstrate the effectiveness of the proposed FVTN.

Source Outfit Target Person
Try-on Images FIGURE 6: Failure cases of our method on the virtual try-on task.
Failure Cases. Fig. 6 displays two failure cases of our proposed FVTN on the virtual try-on task. The example of the top row is caused by the rarely-seen human poses while another example is due to the viewpoint transformation from the front view to the back view.

V. CONCLUSION
In this work, we propose a novel Flow-based Virtual Try-on Network (FVTN), which aims at generating photo-realistic try-on results. We present three tailored modules, i.e., Parsing Alignment Module (PAM), Flow Estimation Module (FEM) and Fusion and Rendering Module (FRM). Specifically, we design an unsupervised flow-based spatial alignment scheme in FEM to precisely capture clothing deformation. We propose an image synthesis network in FRM to synthesize the try-on look by integrating information from the warped clothing and the human body. The results clearly show the great superiority of our proposed FVTN in terms of quantitative metrics, visual quality and user study.