Deep Learning in Virtual Try-On: A Comprehensive Survey

Virtual try-on technology has gained significant importance in the retail industry due to its potential to transform the way customers interact with products and make purchase decisions. It allows users to virtually try on clothing and accessories, providing a realistic representation of how the items would look and fit without the need for physical interaction. The ability to virtually try on products addresses common challenges associated with online shopping, such as uncertainty about fit and style, ultimately enhancing the overall customer experience and satisfaction. As a result, virtual try-on technology has the potential to reduce returns and optimise conversion rates for businesses, making it a valuable tool in the e-commerce landscape. In this paper, we provide a comprehensive review of deep learning based virtual try-on models, focusing on their functionality, technical details, dataset usage, weaknesses, and impact on customer satisfaction. The models are categorised into three main types: image-based, multi-pose, and video virtual try-on models, with detailed examples and technical summaries provided for each category. Additionally, we identify and discuss similarities and differences in these methods. Furthermore, we examine the datasets currently available for building and evaluating virtual try-on models, including the number of images/videos and their resolutions. We present the commonly used methods for both qualitative and quantitative evaluations, comparing synthesised images with previous work and performing quantitative evaluations across various metrics and benchmark datasets. We discuss the weaknesses of current deep learning based virtual try-on models, including challenges in preserving clothing characteristics and textures, the level of accuracy of applying the clothing to the person, and the preservation of facial identities. Additionally, we address dataset bias, particularly the domination of female models, limited diversity in clothing featured, and relatively simple and clean backgrounds in the datasets, which can negatively impact the model’s ability to handle challenging situations. Moreover, we explore the impact of virtual try-ons on customer satisfaction, highlighting the benefits that customers can enjoy, which also reduces returns and optimises conversion rates for businesses.


I. INTRODUCTION
The online fashion industry is experiencing a remarkable boom, with a larger number of individuals embracing online shopping like never before [4].Particularly during the COVID-19 pandemic, purchasing clothes online has become a widespread trend.This phenomenon reflects how garment The associate editor coordinating the review of this manuscript and approving it for publication was Prakasam Periasamy .
shopping has evolved for numerous consumers.As the digital sphere continues to expand with the advent of e-commerce and online shopping, there has been a surge of interest in exploring methods to provide consumers with the same experience they would receive when shopping in physical stores [13].
The inability to try on garments poses a severe obstacle to online purchases as consumers would not know if the product would suit, fit and match them [140].In order to succeed and boost sales, retailers must adapt and cater to the increasing number of online consumers.Virtual try-on provides a means for consumers to engage virtually with the product, facilitating a connection between businesses and online consumers.Embracing virtual try-on technology allows businesses to thrive in the online realm and effectively cater to the needs of their customer [86], [101].
Virtual try-on allows for the seamless exploration of various items such as shoes [34], [196], upper clothing [67], [195], lower clothing [58], [189] and more [98], [127].Some models even offer the convenience of modifying multiple items in a single image simultaneously [199].This technology empowers consumers by providing an immersive experience and complete freedom in decisionmaking, enabling them to experiment with various products, resulting in a confident and personalised purchase.Using this technology can enhance the online shopping experience by resembling an atmosphere where consumers feel they are shopping in a physical store.
Several prominent brands have embraced virtual try-on technology as part of their online platforms.For instance, RayBan [1] enables consumers to try on glasses virtually using their webcam's live feed.Similarly, L'Oréal [3] allows consumers to experiment with different makeup options and change their hair colour by utilising a live feed from their webcam or by uploading an image.Hugo Boss [2] utilises a 3D avatar that allows customers to visualise how a desired garment would look when worn.
There are a few existing survey papers that discuss the potential use of artificial intelligence (AI) in supporting the fashion industry and facilitating fashion-related tasks [27], [59], [61], [132], as shown in TABLE 1.These literatures cover a wide range of topics, such as how AI can benefit the fashion industry, how fashion tools are developed and employed, which countries are leading in research on AI for fashion, how fashion data is utilised to enhance the effectiveness of AI models, and the classification of fashion-based AI tools.
However, there is a critical gap left by these existing reviews.They lack a focused examination of the technical details of deep learning based virtual try-on models and have not examined the effects and the impact that virtual try-on models have on a wider scale.Moreover, they only cover a few outdated papers, such as VITON [67] and CP-VTON [181], with brief explanations of how they work.None of the survey papers are dedicated solely to virtual tryons.Instead, they incorporate a wider range of AI tools for the garment industry.We believe that virtual try-on models are among the most powerful AI fashion applications as they provide immediate personalised feedback to customers, and there is significant development in this field that is worth investigating.Therefore, we present a survey paper dedicated to these models and we show their technical development and synthesise findings from various studies to offer a comprehensive understanding of how virtual try-on models can reshape online shopping experiences and drive business growth in the digital age.
We categorise virtual try-on methods based on common characteristics, providing a nuanced understanding of their functionality and efficacy.This in-depth categorisation sets our survey apart from existing works, which may not have delved as deeply into the specifics of virtual try-on technologies.Furthermore, our paper explores the impact of these models on consumers, shedding light on user experiences and perceptions-another dimension often left unexplored in prior surveys.
The key contributions of this study are as follows: 1) We fill a literature gap by conducting a comprehensive review of deep learning based virtual try-on models, focusing on three main types: image-based, multi-pose, and video virtual try-on models, with detailed examples and technical summaries provided for each category.2) We examine the datasets currently available for building and evaluating virtual try-on models, considering the number of images/videos and their resolutions.Commonly used methods for both qualitative and quantitative evaluations are presented, including The scope of this study.Our survey paper explores image-based, multi-pose, and video virtual try-on models with a focus on technical details.We delve into how these models impact practical values like utility, enjoyment and social, as well as their perceived risks.We analyse how these models impact a business's sales and conversion rate.The paper also covers the datasets used, how experiments are conducted, and how these models are evaluated.
comparisons among models and quantitative assessments across various metrics and benchmark datasets.3) We discuss the weaknesses of current deep learning based virtual try-on models, addressing challenges related to preserving clothing characteristics and textures, the accuracy of applying clothing to individuals, preserving facial identities, and dealing with dataset bias.4) We explore the impact of virtual try-ons on customer satisfaction, emphasising the benefits that customers can enjoy.

II. METHODOLOGY
We provide a detailed account of the methodology employed for conducting the Systematic Literature Review (SLR).
The SLR aimed to comprehensively identify and analyse existing research studies relevant to deep learning-based virtual try-on models.The literature collection process was conducted in a systematic and rigorous manner to ensure the comprehensiveness of the search and the selection of relevant studies.A detailed search strategy was devised in consultation with subject matter experts to identify pertinent literature across multiple academic databases.The following steps outline the process: 1) Identification of Databases: We systematically searched key academic databases, including but not limited to IEEE Xplore Digital Library, ACM Digital Library, Scopus, Web of Science, and Google Scholar, to retrieve relevant publications before 2024.refined through pilot searches to ensure its effectiveness in retrieving relevant studies.3) Inclusion and Exclusion Criteria: Clear inclusion and exclusion criteria were established a priori to guide the selection of studies.Peer-reviewed and well-cited arXiv articles published in English were considered eligible for inclusion.Studies lacking empirical data or those not directly related to the scope of the research were excluded.4) Screening Process: We have conducted the initial screening of retrieved studies based on title and abstract to assess their relevance to this topic.Then we conducted a thorough full-text assessment to determine their suitability for inclusion based on our predefined criteria.5) Reference List Inspection: The reference lists of included studies were manually inspected to identify additional relevant publications missed during the database search.
A. SCOPE OF THE STUDY FIGURE 1 illustrates the scope that will be included in this study.We focus on virtual try-on models that utilise deep learning methods.The deep learning based virtual try-on models are divided into image-based, multi-pose, and video virtual try-ons.We will investigate research papers examining how virtual try-on models influence utilitarian value, hedonic value, social value, and perceived risk.These factors determine and collectively shape customer satisfaction [51], [56].We will investigate the impact it has on sales and conversion rates.Finally, we will explore the datasets used by virtual try-on models, their experiment procedures, and evaluation methods.We will investigate how deep learning based virtual try-on systems can be categorised and what characteristics they have in common.We conduct a comprehensive review of existing literature related to virtual try-ons across the aforementioned categories.We will identify any patterns in the use of techniques within each category, which can help determine the direction of future development.After reviewing these studies, we discuss how these models measure performance and how they compare with one another.
In the following subsections, we will first provide a brief review of the two fundamental generative models that have been widely used in many virtual try-ons applications, the generative adversarial networks (GANs) and diffusion models.
In order to control the generated output images in the realm of GANs, the adoption of Conditional GAN (cGAN) [131] emerges as a promising solution.Various methodologies exist for guiding the image generation process within cGANs.Notable examples encompass the utilisation of class labels [17], [138], textual descriptions [141], [148], [151], [193], attributes [164], and sketches [84], [112].These techniques enable cGANs to produce images aligned with specific criteria or desired characteristics.Consequently, the applications of cGANs, particularly in the domains of virtual try-on and fashion-related contexts [121], [123], have gained significant relevance.We show an example in FIGURE 2 of how cGAN uses conditional data to produce the desired output.
A majority of virtual try-on models have incorporated the utilisation of the GAN mechanism either as a whole or within specific modules [76], [142], [159], [195], [200].By integrating GANs into the virtual try-on framework, these models have demonstrated the ability to generate try-on images with exceptional fidelity.
However, it is crucial to acknowledge the challenges associated with cGAN-based methods when confronted with substantial spatial deformations between the target clothing and the pose of the individual.Notably, CP-VTON [181] has demonstrated instances where cGAN-based approaches can exhibit unstable image generation under such conditions.Consequently, it becomes imperative to develop prerequisite methods that effectively guide cGANs during the image synthesis process, mitigating potential issues arising from large spatial deformations.

C. DIFFUSION MODELS
In recent studies, the performance of diffusion models has surpassed that of GANs in the domain of image synthesis [40].There are three formulations of diffusion models [37], which are denoising diffusion probabilistic model (DDPM) [74], noise conditioned score network (NCSN) [170] and stochastic differential equation (SDE) FIGURE 3. Diffusion models for image generation in virtual try-on.It involves two Markov chains: the forward process, which adds gradual noise to data, and the reverse process, which reverses the effects of the forward process using a neural network.x t denotes the current timestep of noise applied to an image.x 0 is the original image that a neural network is trained to gradually produce, while x T is the most perturbed image that the network starts with.[171].In principle, they all involve two Markov chains, as shown in FIGURE 3. The forward process is usually designed manually to convert any data distribution into Gaussian noise in gradual steps, while the reverse process uses deep neural networks to undo the Gaussian noise to synthesise an image incrementally.
Like GANs, output images of diffusion models can be controlled through textual descriptions [154] or input images [157].This flexibility opens up various possibilities for their application in the field of fashion.
Integrating diffusion models into fashion synthesis is still a relatively new research direction, and only a limited number of studies have been conducted thus far [11], [19], [94], [133], [211].However, given the promising results achieved and the potential of diffusion models in virtual try-on applications, it is plausible to expect that their utilisation in this area will become more prevalent in the near future.

III. TYPES OF VIRTUAL TRY-ON
In this section, we organise virtual try-on models into distinct categories based on their common functionalities and characteristics.We discuss their respective architectures and provide a detailed explanation of how they operate.It is worth noting that certain models may exhibit overlapping functionalities and may be included in multiple categories to reflect their diverse traits.

A. PHYSICS-BASED VIRTUAL TRY-ON
Virtual try-on models were first developed as physics-based simulations.These simulations involve using 3D data to create a virtual garment that fits onto a 3D avatar [64], [144], [146], [162].Mostly mathematical models are used to manipulate clothing data and create realistic wrinkles [64], [146], [162], but newer models have started using neural networks [144].The downside of these models is that physics-based algorithms are computationally expensive and difficult to control [144].Additionally, they require either 3D scans of humans [162] or clothing [146], which is a time-consuming and impractical process.
In this paper, our emphasis will be on the deep learning-based models that have gained more popularity within the research community and shown greater promise in applications.Interested readers can refer to [27], [59], and [61] for further details regarding the 3D virtual try-on models.

B. IMAGE-BASED VIRTUAL TRY-ON
An image-based virtual try-on model takes input from images of both a user and a target clothing item and overlays virtual representations of the clothing onto the user's body.The criteria for an image-based virtual try-on model are: • Preserving the posture and body shape of the person.
• Preserving clothing items that are not intended to be replaced.
• Ensuring the target clothing item fits well to the intended body part of the person.
• Retaining the texture and details of the garment.We provide an example of an image-based virtual try-on in FIGURE 4, showcasing how these techniques can overlay virtual clothing onto individuals.In TABLE 2, we have listed all image-based virtual try-ons, highlighting the warping method they use and their contributions.Image-based virtual try-ons can be split into two categories: models that use a clean image of the clothing item and models that transfer a garment from one person to another.
CAGAN [88] is the first image-based virtual try-on model.This model employs a single network that combines various images to create a try-on image.Nonetheless, an important drawback of this method is its dependence on both the target clothing and the original garment images during the inference phase.Reliance on the original clothing image poses a significant constraint because users may not be able to provide it.VITON [67] has made this more practical by employing an approach that utilises only the image of target clothing.Many researchers focus only on replacing a person's upper garment.This is because they use a commonly available dataset that only contains pairs of candidates with their upper clothes.By doing so, they can easily compare their work with previous research.However, some models are capable of changing any type of clothing, including trousers [110], [115], [116], [133], [134], [135].
Input data is crucial for virtual try-on models, which take into account not only the person and clothing images but also pose maps [20], [166], though the type of pose map can vary.References [66], [181], [195], and [152] uses a multi-channel pose heatmap where each channel represents a key joint of the human body.Meanwhile, there are models that use an RGB skeleton image to show the spatial relation among the joints [9], [30], [82].References [6], [135], [159], and [187] utilise DensePose [65] to extract texture from an image and apply to a UV parametrisation of a human model.The parser-free models use pose maps during training but not during inference [58], [85].In FIGURE 6, we present the pose heatmaps.
The majority of virtual try-on models use GANs to create photorealistic images [76], [109], [195].They use discriminators and GAN loss to encourage their modules to ensure accurate segmenting and photorealism of the image [47], [82], [134], [200].Other models tend to use a generator only for synthesising the image [67], [191] and use the perceptual loss from a pre-trained VGG network [92] to ensure realism.More recently, researchers have started using diffusion models instead of the traditional GAN approach to synthesise virtual try-on images.Recent studies have shown that diffusion models can perform this task better than GANbased methods [63], [133], [211].
In the early method of virtual try-on, they would employ a generic convolutional encoder-decoder architecture with skip connections [155] to synthesise a virtual try-on image.Over time, models have developed more sophisticated models of the architecture to improve the fidelity of images such as using a transformer [23], [152], creating normalisation layers or blocks [54], [110], [111], [175], [197] and using feature pyramid networks [66], [136], [188].
There have been challenges in maintaining the identity and details of clothes in virtual try-on models.Virtual try-on models have developed warping modules to help preserve vital clothing details and align with the person's body.There are two methods for warping the clothing image: thin-plate spline (TPS) [15] and appearance flow (AF) [209].Models that use TPS determine their parameters by establishing correspondences between the target clothing image and the FIGURE 5. Transferring a garment using a virtual try-on.StylePoseGAN [159] demonstrates its capability to extract clothing from a given image and seamlessly apply it to a candidate.candidate's body key points.To prevent clothing distortion caused by TPS, some models induce constraints [109], [145], [192], [195].On the other hand, some virtual try-on models have utilised AF to warp the garment image [33], [58], [66].AF comprises a set of 2D coordinate vectors, where each vector indicates which pixels in the clothing image should be deformed to align with the candidate.The warping module is not commonly used in garment transfer models because they do not have a clean image of the clothing item [124], [149], [187].
Virtual try-on models use a segmentation module to predict how the semantic layout will look after a virtual try-on [135], [175], [195], [200].This module helps the model determine which parts of the image should be generated or preserved to achieve high-quality results.However, some models are against using segmentation because it can sometimes generate inaccurate or incomplete segments, which can lead to disastrous outcomes.Instead, these models suggest using knowledge distillation [72] to train the student network to generate try-on images without relying on the predicted semantic layout [23], [58], [85], [136].
Alternative methods are needed in scenarios where clean images of clothes are absent.There are researchers who have proposed models capable of extracting the garment from one person and seamlessly applying it to another [6], [159], [187], [211].As depicted in FIGURE 5, we showcase an illustrative example of how this garment transfer process can be accomplished.
Recently, there has been a notable surge of interest among researchers in enhancing the functionality of traditional image-based virtual try-on systems.A particular aspect that has captured attention is the incorporation of a posture modification feature.TABLE 3 shows a list of multi-pose virtual try-on models that aim to generate a realistic image of a person donning the desired garment in a predefined pose while maintaining the identity of the person and product.This task presents great challenges, such as preserving facial detail and clothing texture.FIGURE 7 demonstrates a multi-pose virtual try-on model that can change a candidate's garment and posture.In this section, we will provide an overview of these models and explain their operational mechanisms.
In summary, the evolution of image-based virtual try-on models has witnessed significant advancements, with each approach introducing novel techniques.While early models like CAGAN [88] pioneered the field, they faced limitations.The transition to models like VITON [67] and Swapnet [149] exemplifies a shift toward better practicality.The majority of the work is limited to upper garments.The incorporation of pose maps and segmentation modules provided vital information to networks to synthesise accurate virtual tryon synthesis.Many models employ warping modules, choosing between TPS and AF based to warp the garment.Using transformers [179], knowledge-distillation [72] and creating normalisation layers have played a crucial role in refining image fidelity.The adoption of diffusion models over GANs in recent studies highlights the ongoing exploration of alternative synthesis methods for photorealistic outcomes.

C. MULTI-POSE VIRTUAL TRY-ON
While the image-based virtual try-on models normally attempt to overlay the targeted clothing item onto the user's body without changing the posture (in this sense, it is also referred to as single-pose virtual try-on), the multi-pose virtual try-on models offer more flexibility by generating images with different clothing and onto different postures simultaneously.The criteria for a multi-pose virtual try-on model are: • Transferring the facial identity of the person to the desired pose.
• Transferring clothing items that are not intended to be replaced to the desired pose.
• Ensuring the target clothing item fits well to the intended body part of the person in the desired pose.

VOLUME 12, 2024
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.• Retaining the texture and details of the garment.
MG-VTON [41] represents a pioneering step towards enabling virtual try-on models to synthesise images in new postures.The approach is similar to that employed by conventional virtual try-on models, with modules for generating body segments and warping the target garment.However, the novelty of the MG-VTON lies in its ability to generate images of candidates in diverse poses, which was previously not possible with traditional virtual try-on systems.
Most virtual try-on models have three major modules: segmentation, warping, and try-on synthesis [41], [49], [182].The segmentation module is responsible for generating a semantic layout that aligns with the desired target pose.There are two ways in which a multi-pose virtual try-on model can warp a garment: either through a commonly used TPS [41], [206] or appearance flow [49], [198].With both methods, the garment is warped and transformed to align with the target pose.The try-on module then combines all images and synthesises the remaining parts of the try-on image.
Multi-pose virtual try-on models often use a multi-channel pose heatmap to guide the synthesis of an image that matches the desired posture.In contrast, there are exceptions, such as AB-GAN [206], which uses an RGB skeleton image.The models listed in TABLE 3 utilised GANs; however, it is expected that diffusion models will be used in the future since single-pose virtual try-on models are already utilising them [63], [133], [211].
To summarise, MG-VTON [41] is a significant advancement in generating multi-pose virtual try-on images, which offers more benefits than traditional single-pose models.Like the single-pose variant, multi-pose also consists of three main modules: segmentation, warping, and try-on synthesis.The segmentation module creates a semantic layout corresponding to the desired pose, while the warping module adjusts the garment using techniques like TPS or appearance flow.The try-on synthesis module combines these elements to generate the final try-on image.The majority of multi-pose virtual try-on models use a multi-channel pose heatmap for image synthesis.Notably, GANs are widely employed in existing models.However, we predict that diffusion models will be used in the future since single-pose virtual try-on models are already utilising them [63], [133], [211].

D. VIDEO VIRTUAL TRY-ON
Different to the image-based and multi-pose virtual try-on models, the video virtual try-on models aim to produce a continuous video of a user wearing new clothing from the input of a target clothing and reference video of the user, or in some models, a single image of the user instead of a reference video.The criteria for a video virtual try-on model are: • The facial identity of the person needs to remain consistent throughout the video.
• Preserving clothing items that are not intended to be replaced.
• Ensure the clothing item is consistently fitting, accurately positioned, and smooth in the video.
• The texture and details of the garment should be maintained throughout the video.
A video virtual try-on model uses an image of the target garment to dress a person in a video with spatiotemporal consistency, as demonstrated in FIGURE 8.This approach offers a wider range of viewing angles for the clothing product and illustrates how it moves with the human body.There are challenges presented in this category, such as ensuring the clothing is warped and deformed accordingly throughout the video and eliminating inconsistency between frames.TABLE 4 provides an overview of the methods used by video virtual try-on.This table reveals how researchers explore new and innovative ways of improving video virtual try-on models.
GR-VTON [153] was the first to create virtual try-on videos.This model utilised image-based and 3D techniques to apply dresses to a person in a video.However, relying on 3D data is difficult to gather and work with [42], [147].The first deep learning based approach for producing video virtual try-on is called image2Video [147].Their method uses images of segmentation, clothing, and video to generate virtual try-on videos.
Some of the video virtual try-on models employ TPS to warp the garment [42], [89], [104] whilst others perform garment transfer [44], [207].Optical flow (OF) [45], [172] is a common technique utilised by video virtual try-ons [26], [42], [44].OF allows the model to calculate offsets between adjacent frames in videos and helps to warp and align the garment and the person throughout the video to ensure consistency.
The earlier models for video synthesis used variants of the standard U-Net architecture [155].These models added extra components like a memory module and made use of optical flow to guide the U-Net [42], [147].However, recent models have achieved better performance by using the transformer model [179] to create try-on videos [80], [89], [176].
The discriminator has an important role in some video generation models to improve the temporal consistency of generated try-on videos [42], [89], [147].Moreover, it can also be used to enhance the overall quality of the video by improving the sharpness, clarity, level of detail and reducing visual artefacts or distortion that may appear in the generated video [42], [147], [207].
The evolution of video virtual try-on models has witnessed significant advancements, beginning with GR-VTON [153] FIGURE 8. Examples of virtual try-on videos produced by ClothFormer [89].The model aims to synthesise a try-on video using data from the garment.
as the pioneer, leveraging image-based and 3D techniques to apply dresses in videos.Despite its innovation, reliance on 3D data posed challenges.Subsequent models like image2Video [147] emerged, employing deep learning to generate virtual try-on videos by utilising images only.Techniques such as TPS are used for garment warping.Many models use OF to ensure alignment and consistency between garments and individuals throughout the video.Early models incorporated U-Net architectures [155] with additional components.Recent advancements have embraced transformer models [179], achieving superior performance.The discriminator's role is pivotal, enhancing temporal consistency and overall video quality by refining sharpness, clarity, detail and reducing visual artefacts and distortions in generated try-on videos.

IV. DATASETS AND PERFORMANCE EVALUATION
In this section, we will discuss the dataset commonly used to train and evaluate virtual try-on models.Datasets serve as the foundational building blocks for the development and evaluation of such technologies.These datasets play a pivotal role in training algorithms, enabling virtual try-on algorithms to simulate realistic clothing interactions and user experiences.The diversity and quality of the dataset directly influence the accuracy and generalisation capabilities of virtual try-on systems.
It is crucial to demonstrate the quantitative and qualitative performance of virtual try-on models because they can significantly influence the shopping experience of customers [101], [204].The accuracy of a virtual try-on model in generating content impacts the confidence of customers believing that the clothing item will fit and suit them in real life.

A. DATASETS
As outlined in TABLE 5, the datasets presented are highly valuable for training various groups of virtual try-on models.These datasets have gained significant recognition and are widely utilised within the research community.
All datasets listed for image-based virtual try-ons have a comprehensive collection of candidate images along with their corresponding clothing counterparts.The VITON [67], FashionTryOn [206] and MPV [41] datasets possess a resolution of 256 pixels for height and 192 pixels for width, while the VITON-HD dataset [30] has been crafted to yield superior quality try-on images by employing significantly higher resolutions, with a height of 1024 pixels and a width of 768 pixels.
The FashionTryOn and MPV datasets introduce an additional component wherein each candidate is presented with a different pose, further enriching the dataset for multi-pose virtual try-on research.This dataset maintains a resolution of 256 pixels for height and 192 pixels for width.
The authors of the MPV dataset have made the decision to withdraw it from public availability.The specific reasons behind the withdrawal have not been disclosed.
The MVC dataset introduced [122] offers distinct characteristics compared to the FashionTryOn and MPV datasets.MVC provides four different views (front, back, left, and right views) for each candidate, enabling a comprehensive multi-view analysis of clothing.The dataset has 161260 images and a resolution of 2240 pixels for height and 1920 pixels for width.
The video virtual try-on (VVT) dataset [42] consists of 791 videos.These videos were recorded at a rate of 30 frames per second, with a resolution of 256 pixels in height and 192 pixels in width.Each video has a different length, ranging from 250 to 300 frames.For training and testing purposes, the authors of the ShineOn framework [104] divided the dataset into a training set with 159170 frames and a testing set with 30931 frames.Each video in the dataset is paired with a clean image of the upper clothing item, like shirts or blouses.
The Dwnet dataset [202] comprises 500 videos in the training set and 100 videos in the testing set.These videos consist of approximately 350 frames each and are recorded at 30 frames per second.In these videos, a female model is depicted wearing a dress and engaging in simple movements, such as shifting from side to side.This allows for a continuous and varied view of the dress from different angles.The average resolution of the videos is 720 pixels wide and 940 pixels tall, although this varies.It is worth noting that, unlike the VVT dataset, the Dwnet dataset does not include a clean image of the dress.
In TABLE 5, most of the datasets listed show an adult female standing upright against a white background.In these images and videos, the model is wearing various upper garments.This enables virtual try-on models to handle all sorts of cases regardless of the person's posture and the type of garment.

B. PAIRED V UNPAIRED SETTINGS
Typically, a virtual try-on model uses two types of data: an image or video of a candidate and an image of the clothing product.The models are trained using the paired settings.This means that the candidate is paired with a clothing item that they are already wearing.By doing this, the synthesised content can be compared to the original candidate image or video without needing to modify it.This type of learning is called supervised learning.
Evaluating and comparing work is another reason for using paired settings.TABLE 6, 7, 8 show quantitative comparison in terms of quality of synthesised images/videos in paired settings.
Unpaired settings is employed for real-life scenarios.This entails pairing the candidate's image with clothing items that the person desires.This enables consumers to explore and experiment with various garments virtually.Once a model is trained in paired settings and has learned how to map the original clothing onto the person, it should be able to work on unpaired settings.
There are only a few studies that use unpaired settings to evaluate the accuracy of models that generate try-on images [82], [83], [190].These studies have demonstrated that some models produce inaccurate results, for example, transforming a long-sleeved target garment into a shortsleeved one [82].To demonstrate their performance, they use unpaired settings to show that they perform better.However, since there is no ground truth for unpaired settings, these models use pseudo ground-truth to show how a garment is meant to fit a person, even if the person's appearance and pose in the synthesised image are different.

C. QUALITATIVE EVALUATION
Facilitating a direct comparison of try-on images among virtual try-on methods is a crucial aspect of evaluating their performance.Researchers assess and contrast the outcomes of different techniques.FIGURE 9, extracted from a popular virtual try-on research paper [54], serves as an illustrative example of this approach.
Many works present specific selections of candidate images with varying poses being paired with garments with varying sleeve lengths.By doing so, researchers aim to comprehensively demonstrate how their model performs across a variety of real-world situations.
The purpose of the qualitative comparison is to highlight the strengths and limitations of the presented virtual try-on model in relation to its competitors.By placing the visual outputs of different methods side by side, researchers can effectively showcase the unique features, accuracy, and realism achieved by their approach.This visual analysis provides valuable insights into the model's ability to accurately synthesise the try-on image and adapt to various garment types.

D. QUANTITATIVE EVALUATION
There are various methods available to evaluate the quality of synthesised try-on images on a test dataset.Some popular quantitative methods include the Structural Similarity Index (SSIM), Inception Score (IS), Fréchet Inception Distance (FID), and learned perceptual image patch similarity (LPIPS).The majority of researchers have used paired settings for evaluation, which means that the person in the image is paired with a clothing image that they are already wearing.
The SSIM metric [184] gauges the resemblance between the synthesised image and the corresponding ground truth by evaluating the luminance, contrast, and structural similarities.The magnitude of the SSIM index directly reflects the level of concordance between the two images, with larger values indicating superior correspondence.The formula for SSIM is: The Structural Similarity Index (SSIM) formula compares the structural similarity between two input images, denoted as x and y.It incorporates various variables to measure the average intensity (µ x , µ y ), spread or variability of pixel intensities (σ 2 x , σ 2 y ), and linear relationship (σ xy ) of the pixel values.Additionally, small constants (C 1 , C 2 ) are included to ensure stability and prevent division by zero.By evaluating the luminance similarity, contrast similarity, and structure similarity, the SSIM metric produces a similarity value ranging from 0 to 1, where 1 represents a perfect match or similarity between the images.
The Inception Score (IS) [158] is a metric for evaluating the quality of generative models.It measures the diversity and visual appeal of the generated images by feeding them through a pre-trained classifier and computing the score based on the output probabilities.Specifically, the IS is calculated as the exponential of the expected value of the KL divergence between the class distribution of the generated images and the class distribution of a large set of real images.A higher IS indicates that the generated images are more diverse and visually appealing.

IS = exp E x∼P G [D KL (P(y|x)∥P(y))]
(2) IS measures the quality and diversity of generated images.In the formula, P G represents the distribution of generated images.The notation E x∼P G indicates that x is sampled from this distribution.The conditional distribution P(y|x) represents the distribution of labels y (value from pre-trained classifier) given a generated image x.The marginal distribution P(y) represents the overall distribution of labels.The Kullback-Leibler divergence, denoted as D KL , quantifies the dissimilarity between these two distributions.The formula calculates the expected value, denoted by E, of the Kullback-Leibler divergence overall generated images.Finally, the exponential function exp is applied to the expected Kullback-Leibler divergence to obtain the Inception Score.This score provides a measure of the quality and diversity of the generated images based on the discrepancy between the conditional and marginal label distributions.
The Fréchet Inception Distance (FID) metric [71], [161] leverages the widely used Inception network [174] to extract feature representations from both real and synthesised images.Subsequently, it quantifies the divergence between the two distributions of features by computing the Fréchet distance.Notably, a lower FID score implies that the feature distributions of the generated images are more closely aligned with those of the real images.

FID(P,
The FID formula compares the similarity between two probability distributions, P and Q, typically representing real and generated data distributions.The mean vectors, µ P and µ Q , represent the average feature vectors of the data in each distribution, indicating their central tendencies.The covariance matrices, C P and C Q , provide information about the spread or variability of the feature vectors.The squared Euclidean distance, ∥ • ∥ 2  2 , measures the dissimilarity between the mean feature vectors.The term Tr(C P + C Q − 2 C P C Q ), which involves the trace operator (sum of diagonal elements), captures the difference between the covariances of the two distributions.By combining these components, the FID formula quantifies the discrepancy between the real and generated data distributions, providing a metric for evaluating the quality of generated data.
There are two types of Video Frechet Inception Distance (VFID) that can be utilised to assess the quality of synthesised videos, namely VFID I3D [21] and VFID ResNeXt101 [68].Both I3D and ResNeXt101 are deep learning models that have achieved state-of-the-art performance in video object detection and action recognition tasks.They can extract temporal and spatial features from real and synthesised videos and use the equation from Eq. 3 to determine their performance scores.
The Learned Perceptual Image Patch Similarity (LPIPS) [203] metric employs a pre-trained deep neural network 29490 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
that has been fine-tuned to assess the perceptual similarity between images.The network is trained to capture human perception in the context of image quality.To determine the perceptual distance between two images, LPIPS calculates the dissimilarity between their respective feature maps of deep convolutional networks across multiple spatial scales and computes the average of these values to yield an overall score.A lower LPIPS score indicates that the generated images exhibit higher perceptual similarity to the real images.
Metrics such as IS and FID may not always accurately assess the output of a generative model [16], [31].The sample size used to calculate FID should be sufficiently large; otherwise, smaller sample sizes can lead to an overestimation of the actual FID [31].Also, IS and FID may not be ideal for diagnosing specific issues related to diversity [16].SSIM was initially designed to evaluate the video compression ability of a model, and it is not a suitable metric for assessing the model's distortion performance, such as garment warping [129], [137].To mitigate these issues, researchers conducting various types of evaluations, including comparisons of quantitative, qualitative, and user studies, can provide researchers with a clearer understanding of how their model performs.
In TABLE 6, 7 and 8, we have used FID, IS, LPIPS, SSIM, VFID I3D, and VFID ResNeXt101 as evaluation metrics to compare various models.These are the most commonly used metrics in the field.However, researchers could have used more advanced metrics, such as MS-SSIM [185] or KID [12], which could have provided a more accurate depiction of their model's performance.Due to the lack of consistency in reporting these metrics by different researchers, comparing the results has become more challenging.

E. COMPARISON
In this section, we aim to demonstrate the quantitative comparison made amongst virtual try-on models, assessed in terms of the quality of the synthesised try-on content.TABLE 6, 7 and 8 present published results of all categories of virtual try-on models, showcasing the scores achieved across various metrics.
TABLE 6 shows the results reported by researchers from different datasets.Most of these scores were evaluated on the VITON dataset [67].The mean FID score for each year is as follows: 21.957 in 2018, 14.922 in 2020, 12.925 in 2021, and 11.563 in 2022.Furthermore, there is one model published in 2023 that scored 8.82.This suggests that every year, the FID score decreases, which indicates that the image quality gradually improves and these models are making advancements.The increase in the SSIM score over the years indicates promising improvements.The mean SSIM score for each year is as follows: 0.761 in 2018, 0.81 in 2020, 0.852 in 2021, and 0.893 in 2022.One model published in 2023 has achieved an impressive SSIM score of 0.918.The highest performing models for the FID metric in each year are ACGPN [195] in 2020, DP-VTON [22] in 2021, and VTON-SCFA [48] in 2022.As for SSIM, the highest performers are ACGPN [195] in 2020, Zflow [33] in 2021, and ST-VTON [32] in 2022.
Modern image-based virtual try-on models are evaluated on the VITON-HD dataset [30] due to their higher resolution and better quality.The mean FID score for each year is 13.645 in 2021, 10.91 in 2022 (based on a single model), and 8.606 in 2023.The decreasing FID score indicates an improvement in the performance of the models.Similarly, the mean SSIM score for each year is 0.8525 in 2021, 0.892 in 2022 (based on a single model), and 0.883 in 2023.The increasing SSIM score shows that the models are getting better at synthesising virtual try-on images.The highest performing models for the FID metric in each year are VITON-HD [30] in 2021 and GC-VTON [150] in 2023.As for SSIM, the highest performers are VITON-HD [30] in 2021 and GP-VTON [188].
TABLE 7 displays the quantitative performance of multi-pose virtual try-on models that have been reported by various researchers from two datasets.Most of these scores were evaluated on the MPV dataset [41].However, when it comes to examining the metrics, it is challenging to identify whether multi-pose virtual try-on models are making any quantitative improvements.This is mainly due to the lack of adequate results to draw a conclusion.That being said, the models that performed best in the FID metric are SS-VTON [69] for 2022 and VTON-MP [198] for 2023.The models that performed best in the SSIM metric are MV-TON [49] for 2022 and CF-VTON [46] for 2023.TABLE 8 presents the quantitative evaluation of video virtual try-on models.It is worth mentioning that research in this category is not as active as it is with image-based and multi-pose virtual try models.Due to the limited amount of scores, it is not possible to draw a conclusion about whether newer models are making actual improvements.The ClothFormer [89] model performed the best out of all models based on all metrics.

V. IMPACTS AND APPLICATIONS
In this section, we will scrutinise the influence of virtual try-on models on both customers and businesses.Our exploration will encompass the assessment of hedonic, utilitarian, perceived risk, and social values associated with virtual try-on technology, providing insights into its profound effects on customer satisfaction [51], [56].Additionally, we will delve into the strategic advantages of this technology, explaining how it augments sales and optimises conversion rates.

A. HEDONIC VALUE
The concept of hedonic value pertains to the enjoyment and satisfaction customers feel when they engage in a certain task [8].Shopping, whether in physical stores [8], [14] or online [29], [75], provides a significant source of pleasure and motivation for customers, which leads to a positive shopping experience.Several studies have demonstrated that TABLE 6. Performance comparison of image-based virtual try-on models.Higher values indicate better results for SSIM and IS, while lower values are desirable for FID and LPIPS.Some scores have been reported in papers other than the original publication: α: SVTON [83]; β: DCTON [57], γ : ST-VTON [32], δ: VTON-HF [47].

TABLE 7.
Performance comparison of multi-pose virtual try-on models.Higher values indicate better results for SSIM and IS, while lower values are desirable for FID and LPIPS.Some scores have been reported in papers other than the original publication: α: TB-VTON [182], β: CF-VTON [46].using virtual try-on technology results in a positive hedonic experience for customers [28], [101], [107].
In the fashion industry, customers have a strong desire for exceptional shopping experiences [13], and virtual try-ons TABLE 8. Performance comparison of video virtual try-on models.Higher values indicate better results for SSIM, while lower values are desirable for VFID I3D, VFID ResNeXt101, and LPIPS.Some scores have been reported in papers other than the original publication: α: ClothFormer [89].
allow it to happen.If customers' shopping expectations are not met, they may choose to engage in other leisure activities.Therefore, businesses need to ensure that their virtual try-on models are enjoyable to use [108].This claim is also supported by the technology acceptance model (TAM) [38], which states that the adoption of technology is influenced by its perceived hedonic value.
Customers who perceive that their body is relatable to the virtual model are more likely to enjoy using virtual-try ons, leading to increased hedonic value [128].This phenomenon is known as self-congruity, which refers to the tendency to compare oneself with other objects and stimuli [120].

B. UTILITARIAN VALUE
The utilitarian value of a tool or technology refers to how useful customers perceive it to be and how effective they believe it will be in helping them achieve their goals [180].Virtual try-on technology, for example, enables customers to quickly and conveniently try on numerous fashion items from any location, helping them to assess the size and fit of apparel [86].Traditionally, customers need to physically see, feel, touch and try on clothing products before they can make a purchase [35].This is difficult in the online realm; however, virtual try-on technology has the potential to reduce the strain.
In many studies, it has been shown that customers found virtual try-on models to have a high utilitarian value [10], [28], [101], [108].They praised its ability to allow them to try on various fashion products and assess how well the items complimented their skin tone, hair colour, and other personal attributes.Furthermore, virtual try-ons' utilitarian values are significantly influenced by customer motivation, potentially affecting adoption intention [108].
The customers' perception of the utilitarian value towards virtual try-on models is a critical factor in determining their acceptance and usage [38], [128].Virtual try-on models have to be user-friendly and straightforward to navigate; otherwise, individuals will be reluctant to utilise them and will diminish their utilitarian value [128].
Another important factor in increasing the utilitarian value of virtual try-ons is self-congruity.If customers perceive that their bodies are compatible or similar to the virtual models on the website, they are more likely to assume that the way a clothing product fits a virtual model will fit them in real life [128].It is worth mentioning that customers with higher body esteem tend to perceive the virtual model as more self-congruent.This suggests that a virtual try-on model's utilitarian value can be influenced by the customer's perception of their own body [128].
It is important to note that not all customers find virtual try-on models to be very functional.In particular, those who have a low degree of self-congruity with virtual models may not see the value in using a virtual try-on model [128].Additionally, some customers may be sceptical about how accurately the model can predict how well a garment will fit, which also reduces its utilitarian value [101], [204].

C. PERCEIVED RISK
It is possible that some customers may be hesitant to use virtual try-on technology due to the chance that the virtual model's fit may not accurately reflect their own (meaning the clothing could appear to fit well on the model but not on the customer in real life) [101], [204].The extent to which customers perceive risk plays a significant role in their willingness to use virtual try-ons [204].
When using virtual try-ons, customers may have concerns about the safety of their personal information, such as facial image, height, weight, bust size, waist size, and body shape.This information is transferred to other parties and may be at risk of being leaked, which can create security risks [105], [204].Many mature users are hesitant to input sensitive information online, making them less likely to use tools that require it [105].
The level of perceived risk affects the connection between the utilitarian value and customers' overall opinion about virtual try-on technology [28].Therefore, it is essential for businesses to help customers build confidence in using virtual try-on models to overcome any perceived risks.
Age is an important factor in determining the level of risk perception for customers.Minors (under the age of 18) are more interested in trying out new technologies like virtual try-ons and perceive these tools to be less risky compared to adults [201], [204].Minors may be more eager to use virtual try-ons due to their lower perceived risk or lack of concern about sharing personal information to interact with the technology [119].

D. SOCIAL VALUE
In the context of online shopping, social value refers to how a product affects a customer's perception of themselves and their status to others.Not only do customers evaluate the hedonic or utilitarian value of the product or service but also how it influences their image to others [173].Virtual try-on models can help customers evaluate the social value of clothing products by allowing them to see how they look on themselves and consider how others would perceive them.
Having a positive body image makes customers more inclined to share pictures of themselves, including virtual tryon images, on social media [52].For this reason, body esteem significantly influences the perceived social value that virtual try-ons can provide [86].When customers feel confident in their own bodies, they can use the try-on result to assess how a clothing item looks on them and how others perceive it.
Virtual try-on technology allows customers to generate and share previews of themselves wearing trendy clothing, increasing social value [86], [93].Some customers prefer socially connected online businesses.This means making it practical to share content with friends and family [39], making virtual try-on a valuable asset.

E. MULTI-GROUP DIFFERENCES
There are a few studies that have investigated differences in attitudes among various social identity categories towards virtual fitting rooms.There are no notable variations between genders when it comes to their preference towards virtual try-on systems [101], [204].This means that all genders are equally likely to utilise virtual fitting room tools in the same manner.Customers under the age of 18 exhibit a more positive attitude towards virtual fitting tools compared to those above the age of 18 [204].
In a utilitarian sense, male interviewees indicated that virtual try-on models would be useful for buying suits and jeans, whereas female interviewees believed that models would be helpful when shopping for underwear, bathing suits, or dresses, as it allowed them to see how the item would look like on a body [101].

F. IMPACT ON BUSINESS
The use of deep learning/generative AI models in the business field is a relatively new concept and has not yet been extensively explored in academic research.As a result, there is limited information on the impact of these models on business activities, outcomes and financial performance [102].However, recent studies have suggested that AI tools can help businesses provide personalised, timely and relevant content to customers, resulting in greater customer satisfaction [18].Additionally, the use of deep learning and generative AI models can enable businesses to perform tasks at a lower cost and more effectively than human teams and even tackle tasks that are impossible for humans to perform [102], [103].Virtual try-on models, in particular, have many benefits for businesses, such as promoting personalised marketing and performing tasks that are not possible for humans.
Virtual try-on technology has been shown to have several positive outcomes for businesses.These outcomes include increased visitor numbers, longer time spent on the site, higher spending, and greater sales, as well as an increase in the conversion rate of customers [101].By employing virtual try-on models, it is observed that the return rate was significantly decreased [81], and the customers were willing to spend more [59].Research has indicated that the adoption of virtual try-on technology for online apparel shopping can significantly reduce the risk regarding apparel fit perceived by consumers when shopping online while also increasing the enjoyment of the shopping experience [99].The perceived entertainment value of virtual try-on technology has been found to have a positive influence on attitudes towards the technology, as consumers tend to enjoy immersing themselves in the virtual simulation [204].Additionally, a study focusing on the US department store industry during the pandemic highlighted the success of virtual try-on technology as an innovative strategy, with potential for success in the post-pandemic world [183].
Moreover, offering virtual fit information has been shown to increase conversion rates and order value [7], while reducing fulfilment costs arising from returns and home tryon behaviour [55].It is also suggested that the direction of technology acceptance model-related research should be drawn by the functional or hedonic purpose of the technology/system [100].The effectiveness of 3D virtual fitting technology has been investigated, showing its potential to create well-fitting clothes efficiently, particularly for tailored suits and jackets [169].Furthermore, the application of virtual try-on technology has been explored in the context of reducing consumers' perceived risk about apparel fit, which can positively impact their purchasing experiences [165].

VI. CHALLENGES AND FUTURE DIRECTIONS
In this section, we will delve into the obstacles that virtual try-on models encounter.These challenges include issues related to performance, accuracy, and user experience.Additionally, we will examine the latest advancements in virtual try-on technology and discuss potential future directions for research and development in this field.

A. ACCURACY
Virtual try-on models can sometimes inaccurately deform the garment, resulting in the item appearing as short-sleeved instead of long-sleeved and not aligning with the person's body properly.This is because the segmentation module FIGURE 10.Weaknesses presented by virtual try-on models.(a) Some models find it difficult to accurately apply clothing to a person.The first [195] row shows the sleeve has been ruined, and the second [181] row depicts the clothing did not align with the body properly.The third [69] row shows the body shape has inaccurately shrunken in the synthesised image.(b) Capturing and merging content from conditional data is complex.The first [42] row demonstrates that the model was unable to capture logos, patterns, and textures accurately.Occlusion handling is also a significant challenge and can lead to synthesising low-quality images, as shown in the second [181] row.Even the diffusion model can struggle to capture text and logo from a conditional image, as seen in the third [133] row.(c) Preserving facial identity is extremely challenging for multi-pose and video virtual try-on models.The first [41] and third [182] rows are multi-pose models that did not preserve the facial identity, while the second row [42] is a video model that also failed to preserve the facial identity.
responsible for predicting the semantic layout is not always accurate [82], [85].The garment is distorted based on the predicted semantic layout, so when the prediction is incorrect, the garment will also be distorted inaccurately.
A limitation of multi-pose virtual try-on models is that they may not accurately preserve the body size of the input image when generating a new try-on image in a different posture [69].This means that if the person in the original image is wider, the model may generate a try-on image that makes them appear skinnier.He et al. [69] argue that datasets like MPV [41] do not include enough people with diverse body shapes, which limits the model's ability to capture the body shape of the input image.
FIGURE 10a illustrates how the models [181], [195] failed to maintain the shape of the clothing and align it correctly on the person, resulting in inaccurate images.Multi-pose virtual try-on does not maintain the same level of accuracy as single-pose models.All elements have to be transferred to the desired pose, which can lead to inaccuracies in synthesising images that do not complement the candidate's body shape.This is demonstrated in the third row of FIGURE 10a.

B. QUALITY
Virtual try-on models sometimes produce low-quality content.For instance, they may fail to accurately preserve the logo or capture textures of the target garment.Occlusion handling is another issue that can degrade the quality of virtual try-on images.For example, if a person crosses their arms, some models may struggle to handle this situation and end up producing unrealistic try-on images.It is possible that the image generator may not be capable enough to handle certain pairs of images in some cases.To address such issues, newer researchers have attempted to use more complex loss functions or add more powerful components to the generator [53], [54], [111].
In FIGURE 10b, it is evident that some models have generated low-quality virtual try-on images.The first row [42] shows that the model failed to incorporate the bike that appears in the target garment in the synthesised image.The second row [181] displays poor occlusion handling, as the exposed arms have unnaturally blended behind the garment.The last row [133] displays a result generated using a diffusion model.While diffusion models are known to be superior to GANs [40], they still have limitations.The result shows that the model did not preserve the logo adequately, as the text is not readable.

C. FACIAL PRESERVATION
Image-based virtual try-on models have an easier task because they can copy the candidate's face and place it in the desired position without requiring much additional alteration.However, multi-pose and some video virtual try-on models face a more difficult challenge of maintaining the candidate's facial identity.This is because the face must be transferred to a new position while preserving its unique characteristics.Accomplishing this requires the generator to handle significant spatial deformation, which is an extremely challenging task [181].These models ultimately generate facial images that do not resemble the person.
For example, FIGURE 10c shows the outcomes generated by multi-pose and video virtual try-on models [41], [42], [182].These models do not preserve the faces properly and produce a generic face that only captures the skin colour, hair colour, and length.

D. DATASETS
It is essential to include a wider and more diverse range of images in virtual try-on model datasets.The current datasets mainly consist of adult women who are mostly Caucasians and wearing Western-specific clothing, which underrepresents other ethnic groups.The datasets have less representation of people who are Asian and African, which can lead to biased models and poor performance for customers who fall under those races.Moreover, underprivileged countries with underrepresented cultures will face limited accuracy and reduced compatibility when trying on cultural clothing.Therefore, it is necessary to incorporate a more diverse range of images in the datasets to ensure fair and equal representation for all customers.
Non-diverse datasets can also cause models to exhibit bias against other underrepresented groups such as different genders, body types, ages, disabilities, and locations.This can negatively affect the popularity of virtual try-on models.Research shows that there is no significant difference in the attitude towards virtual try-ons between males and females [101], [204] and that minors (under 18) are more likely to engage with and utilise virtual try-on systems compared to adults [204].Therefore, it is crucial to include as many diverse groups as possible in the dataset to make virtual try-ons applicable to everyone.
Researchers assume that the test set has the same has the same distribution as the train set, which does not reflect the real-world scenario [111].In reality, people can take pictures or videos from various angles or even capture images without the full body of humans, which can affect the performance of all categories of virtual try-ons.Therefore, it is essential to consider real-world scenarios when developing and evaluating these tools.

E. ECONOMICAL IMPACT
Several companies have implemented virtual try-on models on their platforms.These models allow customers to experiment with products virtually.Some examples of such companies are Ray-Ban, L'Oreal Paris, and Hugo Boss [1], [2], [3].However, these companies have not provided any information regarding the benefits for themselves or their customers or whether this tool has had an economic impact.Researchers have pointed out the lack of reported outcomes and financial performance when employing deep learning/generative AI models, as they are a new concept in the business domain [102].Therefore, research is needed to determine if virtual try-on models have a financial impact on businesses.Currently, there are no clear studies exploring this topic.

F. EMOTIONAL CHALLENGES
Virtual try-on models are designed to address the uncertainty that customers may have when purchasing clothing online without the ability to physically try it on.However, some customers may be hesitant to trust the accuracy of these models, fearing that they may not accurately reflect the real-life fit or look of the product [101], [204].This presents a challenge for researchers, who must work to find a solution that will increase customer confidence in the accuracy of virtual try-on models and demonstrate that they effectively reduce the uncertainty associated with online clothing shopping.The development progress of virtual try-on models is significant, but research is needed to indicate whether these models perform sufficiently to the degree that they earn customers' trust in the accuracy and are willing to use them when shopping for clothes online.

G. CODE AVAILABILITY
We have discussed several models in TABLE 2, 3, and 4, but only a few of these models have shared their code publicly.The implementation of models developed by others can be a challenging task due to unclear implementation details, and results may vary from those reported by the original authors.This poses a problem as it reduces the scope for reproducibility and makes it difficult for other researchers to compare their work with these models.

H. FUTURE DIRECTIONS
The image-based virtual try-on category is the most competitive and popular among other categories.Every year, new studies are published showing how image-based virtual try-ons have surpassed previous work.This trend is shown in TABLE 2 where researchers keep making progress and finding weaknesses in prior work, promptly coming up with solutions.It has been observed that researchers are moving away from using GANs and focusing more on using powerful generative models such as the diffusion model [19], [63], [94], [118], [211].
Researchers have proposed various approaches to synthesise multi-pose try-on images, as shown in TABLE 3.These approaches generally consist of three stages: first, generating a semantic map of the target pose; second, warping the garment; and finally, re-rendering the person in the target pose while fusing with the warped garment.However, the single-stage model [69] accomplishes all these steps in a single network.The three-stage trend may continue in future developments.Similarly to single-pose virtual try-on, researchers may use a diffusion model to synthesise multi-pose virtual try-on images.
The development of video virtual try-on is progressing towards using attention mechanism [179] and transformer model [179], as demonstrated in TABLE 4. These methods 29496 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
have exhibited promising results and enabled models to produce impressive outcomes.It is possible that future researchers will use the diffusion model to synthesise virtual try-on videos because there is active research on diffusion models generating videos [60], [73], [167], [186].
Virtual try-ons have been introduced as an alternative to physical fitting rooms, but not many customers have embraced this technology yet [86], [114].This could be because many customers or businesses are not aware of the existence of virtual try-on models.However, with the advancements in technology, improvements in the accuracy of virtual try-ons, and gradual recognition by the masses, it is possible that the adoption of this technology may increase over time.

VII. CONCLUSION
We have reviewed deep learning based virtual try-on models into three categories based on their functionality: imagebased, multi-pose, and video.For each of these categories, we have provided comprehensive examples of models and summarised their technical details and contributions.We have also identified similarities in terms of methods used by researchers.
Furthermore, we have examined the datasets used by these models, including the number of images/videos and their resolutions.We have also observed that researchers tend to conduct qualitative comparisons by comparing their synthesised images with previous work.Additionally, they perform quantitative evaluations across various metrics and benchmark datasets.
We discussed weaknesses of deep learning based virtual try-on models, such as being unable to preserve clothing characteristics and textures and sometimes having difficulty applying the clothing to the person.Furthermore, we discuss dataset bias which mainly consists of images featuring women posing against a white background with limited clothing diversity.This can negatively impact the model's ability to handle less common garment types.
Our research has revealed that a number of enterprises have already integrated virtual fitting rooms into their platforms.These cutting-edge tools have been proven to offer a variety of benefits, which we anticipate will encourage more companies to adopt this technology in order to enhance the decision-making process for their customers and provide a highly positive experience for all involved.
Finally, we have examined how virtual try-ons affect the attributes and factors that lead to customer satisfaction.We have shown that researchers highlight the benefits that customers can enjoy, which also reduces returns and optimises conversion rates for businesses.

FIGURE 6 .
FIGURE 6. Variant of pose heatmaps used by virtual try-on models.Multi-channel heatmaps represent a single key joint in every channel, whereas the RGB skeleton shows how they are connected together.DensePose extracts texture from an image, which can be further manipulated.

FIGURE 7 .
FIGURE 7. Examples of multi-pose virtual try-on images illustrated by He et al. [69].The aim of this model is to synthesise a try-on image and change the posture simultaneously.

FIGURE 9 .
FIGURE 9. A qualitative comparison was carried out by C-VTON [54].This is a common approach researchers use to demonstrate their model's superiority over previous work on the same pair.

TABLE 1 .
Summary of related virtual try-on surveys.

TABLE 2 .
List of single-pose virtual try-on models.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 2 .
(Continued.) List of single-pose virtual try-on models.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 2 .
(Continued.) List of single-pose virtual try-on models.

TABLE 3 .
List of multi-pose virtual try-on models.

TABLE 4 .
List of video virtual try-on models.

TABLE 5 .
Common datasets used on virtual try-on models.