Deep Learning Approaches for Fashion Knowledge Extraction From Social Media: A Review

Fashion knowledge encourages people to properly dress and faces not only physiological necessity of users, but also the requirement of social practices and activities. It usually includes three jointly related aspects of: occasion, person and clothing. Nowadays, social media platforms allow users to interact with each other online to share opinions and information. The use of social media sites such as Instagram has already spread to almost every fashion brand and been evaluated as business take-off tools. With the heightened use of social media as a means of marketing communication for fashion brands, it has become necessary to empirically analyse and extract fashion knowledge from them. Thus, social brands are investing on them. In this way, they can understand the consumer’s preferences. This change is also having a significant impact on social media data analysis. To solve this issue, the Deep learning (DL) methods are proven to be effective solutions due to their automatic learning capability. However, little systematic work currently exists on how researchers have applied DL for analysing fashion knowledge from social media data. Hence, this contribution outlines DL-based techniques for social media data related to fashion domain. In this study, a review of the dataset within the fashion world and the DL methods applied on, it is presented to help out new researchers interested in this subject. In particular, five different tasks will be considered: Object Detection, that includes Clothes Landmark Detection, Clothes Parsing and Product Retrieval, Fashion Classification, Clothes Generation, Automatic Fashion Knowledge Extraction and Clothes Recommendation. Therefore, the purpose of this paper is to underline the multiple applications within the fashion world using deep learning techniques. However, this review does not cover all the methods used: in fact, only Deep Learning methods have been analyzed. This choice was made since, given the huge amount of fashion social media data that has been collected, Deep Learning methods achieve the best performance both in terms of accuracy and time. Limitations point towards unexplored areas for future investigations, serving as useful guidelines for future research directions.


I. INTRODUCTION
Online Social networks are part of every person's life. More than half of the world's population is connected to the internet and has at least one social platform. According to the report carried out by We Are Social of January 2021, in the world there are 7.83 billion people, 66.6% of these have a mobile phone. 4.66 billion people access the internet, an increase of The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia . 7.3% compared to January 2020. World internet penetration stands at 59.5%, but the values could be even higher by virtue of problems related to the correct tracking of internet users related to the COVID-19 pandemic. There are 4.20 billion users of social platforms, an increase of 13%. The use of social platforms therefore stands at 53% of the world population.
In particular, social networks have long since changed the way of communicating and perceiving the world: it is therefore no coincidence that fashion, of which communication VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and perception are two fundamental pillars, is an integral part of this revolution. In fact, the fashion industry is one of the most dynamic in society and in this context social media are fundamental communication tools, in particular Facebook (born in 2004), Instagram (born in 2010) and Tik Tok (born in 2018). Facebook was born in 2004 and, to date, is one of the most used social networks in the world, with over 2 billion active users. To date, many fashion brands are present on Facebook with a company page. The primary goal is to attract new customers and retain existing ones. A strategically managed Facebook page with careful publication of content will make a brand more attractive, involving an increasing number of users.
Instagram was born in 2010 and one of the strengths of this social network is the communicative power of the images that are able to convey the identity of a brand. Tik Tok was born in 2018 and it is a platform where users can express their creativity to the maximum through short videos between 15 and 60 seconds, with background music of all kinds.
The main social reference for the fashion domain is Instagram. However, leading fashion brands have proven the power of social media marketing across multiple channels. Each channel has different features to offer, giving new ways to achieve goals. Facebook is the most used social media platform in the world with more of 2 billion monthly active users. In addition to regular Facebook posts, fashion app marketers can use the platform for live broadcasts.
Instagram has an active global audience of 500 million daily active users, collectively tapping the platforms ''Like'' button 4.2 billion times every day. Ecommerce brands can also use Instagram's shopping features, allowing users to purchase items without leaving the app. Instagram offers several ways to connect with audience, including Posts, Reels, Stories, Highlights and IGTV. Many users use Pinterest as a discovery platform to identify and refine their style: 53 percent of users say Pinterest has helped them make a fashion related purchase decision. Twitter has 330 million monthly active users tweeting 500 million times per day. The social media platform can also be used to successfully promote fashion brands, Fashionista.com (2.1 million followers), Zara (1.3 million followers) and Misguided (466.k million followers) have all built sizable audiences by harnessing Twitter's potential for virality. For example, Fashionista.com have captured the zeitgeist by discussion one of Netflix's most recent (and fashion-centric) shows, Emily in Paris. Snapchat reported 238 million daily active users in 2020. In 2021, Ralph Lauren collaborated with Snapchat to create virtual reality experience for their users. This enabled Snapchat users to style their personal avatars in Ralph Lauren items, which are also shoppable.
Especially on the Instagram social network, fashion brands have started to invest a large part of their budget, as it allows them to publish very accurate and creative images, similar to professional photographs. With the social network Instagram, the influencer phenomenon has been strengthened. As the word suggests, the influencer is a famous person who can influence public opinion and constitute an important target to which to direct advertising messages, in order to accelerate their acceptance by a wider audience. Since these characters inspire confidence, fashion houses have an incentive to invest money and resources in this type of strategy. In fact, as reported by the fashion marketing site MuseFind, 92% of consumers consider an influencer campaign more reliable than traditional advertising with models or celebrities. For this reasons, since 2016, 65% of luxury brands chose to collaborate with influencers for their advertising campaigns, with amazing results.
Then Instagram from a social network becomes more and more like an e-Commerce showcase. In fact, the fashion and luxury brands are making great efforts to keep up with the times and adapt to change. Everyone is equipping themselves with e-Commerce platforms. The purpose of these brands is therefore to understand the preferences of new consumers, to communicate with them directly and without filters, to be able to customize their offer.
Moreover, researchers have proposed several fashion recommender systems in the literature aiming at choosing the right outfit for different occasions [1]. Companies therefore must try to analyze the information that is spontaneously generated by web users. Big Data analysis now makes it possible to predict future trends even before they explode, providing real-time information not only on the volume of sales, but also on that of online searches. More quickly identifying fabrics, styles and colors for which public interest is growing allows us to satisfy the request in a timely manner and consequently to sell more.
For this reason, the interest in applying artificial intelligence (AI) algorithms to Big Data, and in particular those based on Deep Learning (DL), that is a subset of Machine Learning and mainly in the recent years is growing more and more. Thanks to ML techniques, companies operating in the fashion sector can identify patterns in data and build models that can predict future results. This helps to create a more flexible and faster supply chain and manage inventory in an automated and intelligent way. In addition, algorithms for clothing design have been developed [2]: the aim is to provide the customer with a model capable of generating data similar to those given in input and to give advice on the most relevant products. These algorithms are therefore useful for analyzing consistent datasets and automating the process of recognition and classification of the proposed styles.
Considering the latest achievements in data collection and processing [3], DL is facing the worldwide challenge of, on one hand, reducing the need of manual intervention for huge datasets and, on the other, improving methods for facilitating their interpretation. To close this gap, this review aims to provide a technical overview of the advances and opportunities offered by DL for automatically processing and analysing social media data related to fashion domain.
Existing reviews explore particular approaches for analysing fashion data, generally based on Artificial Intelligence techniques to solve a specific issue. There are several examples of well-structured systematic reviews focused on this domain [4]. An example aims to study the impact and the significance of AI in the fashion industry in the last decades throughout the supply chain [5], while the most recent [6] has the aim to study the impact and significance of AI in fashion e-commerce. In the context of fashion recommendation system an interesting and recent review is presented by [7]. The authors in detail describe the technical aspects, strengths and weaknesses of the filtering techniques. Moreover they help researchers, and practitioners of machine learning, computer vision, and fashion retailing to understand the characteristics of several fashion recommendation systems. However one limitation of this study is that a review of the datasets that have been used in fashion recommendation was not considered. An aspect that is considered in this review. Moreover, the novelty of this work relies on social media fashion data. In fact, to the best of our knowledge, a complete review on deep learning based approach for deducing insights from social media images is not present in literature. With this work a thorough survey of the state of art related to the use of social media fashion data and their tasks has been presented, with a particular focus on Deep Learning methods. Methods and techniques for each kind of fashion task have been analysed, the main paths have been summarised, and their contributions have been highlighted. The reviewed approaches have been categorised and compared from multiple perspectives, pointing out their advantages and disadvantages. Finally, several interesting examples of the dataset have been presented.
In particular, the purposes, issues, motivations were investigated to set the following research questions (RQ): RQ1 To make an overview of the main tasks performed by using fashion data, the question to be answered is: For what tasks is fashion data used and how has the use of this data developed over time? RQ2 To explore the most used methodologies in recent years to deal with fashion data, the question that has been set is: Comparing ML and DL methods, does the fashion data influence the choice of using one methodology rather than another? RQ3 To better understand what are the future applications in this area, the following question arises: What are the future applications that need to be developed and deepened that use fashion data? RQ4 To understand how companies can use information from social media, the question that must be asked is the following: How has social media changed the marketing strategies of fashion brands? This paper is structured as follows. Section II describes the methodology adopted in the choice of the articles identified and selected for the review work. Section III describes the dataset used in general for fashion tasks, in particular for object detection. Section IV shows the deep learning methods used for object detection, classification and for generative clothes task. Section V describes some of the dataset and deep learning methods used for the Automatic Fashion Knowledge Extraction and Clothes Recommendation of data from social networks. Section VI presents a discussion of the methods taking into exams. Finally Section VIII shows the conclusions of this work.

II. RESEARCH STRATEGY DEFINITION
In literature, there are still no reviews that speak in general of the different research strategies after asking the research questions on the field of deep learning applied to the field of fashion. Guidelines for the review finalisation. These guidelines are motivated by the fact that deep learning approaches for social media data and fashion dataset are quite new. In particular, if we focus on generative adversarial neural networks (GANs) for fashion domain (e.g. virtual try on with GAN) the interesting paper starting in 2017. These lead to an exclusion of paper dated before 2007 for sake of completeness.
A systematic review of the literature was conducted using PRISMA guidelines and electronic databases: ieeeXplore, 1 Scopus, 2 Sciencedirect, 3 Citeseerx, 4 and SpringerLink. 5 The sequel to a set of keywords was considered. They are chosen in relation to the fashion domain and on the basis of a preliminary screening of the research field. The keywords initially considered in the research were: artificial intelligence, machine learning, deep learning, neural networks, object detection, object parsing, product retrieval, classification, fashion, dataset, generative adversarial networks, social media. To get more accurate results the keywords have been aggregated. In one set of queries, keywords deep learning and fashion was combined with methodology-related others, in other sets deep learning and fashion was combined with application. Each query produced a large amount of articles, which were selected based on relevance and year of publication. Articles found to be inconsistent with the research topic and published before the year 2007 were removed from the list. The articles considered for review were published between the years 2007-2021. In total 219 papers were cited, some concerning datasets, others concerning theoretical methodologies, others concerning applications. The number of articles cited per year is shown in Figure 1.
Furthermore, in Figure 2, it is possible to see the percentage of works carried out for each task treated during the review. In other words, it represents the percentage of articles divided into the tasks that have been taken into consideration.

III. FASHION DATASETS
This section is used to give a detailed description of the datasets collected in fashion world. From 2012 to 2020, 51 datasets were built, which are divided by task and by year of publication as shown in Figure 3. What this graph shows is that the year with the most datasets created was 2019. Looking at the pie chart in Figure 4, it is possible to notice that the most successful task, was Clothes Parsing, with 32% datasets that deal with this tasks.
Finally, the total number of papers is 51: 17 for clothes parsing, 15 for product retrieval; 10 for clothes generation; 4 for clothes landmark detection; 2 for fashion classification and 8 for clothes recommendation. We have to highlight that the total number of categories does not correspond with the number of selected papers, this because two papers concern more than one category.
To summarize all these datasets, Table 1 was built. Here, the datasets from 2012 to 2020 in chronological order are reported. This table is divided as follows: the first column represents the name of the dataset and the respective citation; the second column represents the year of publication; the third column shows the purpose for which the dataset was constructed or used; the fourth column presents the source on which the dataset was built.

A. DATASETS FOR OBJECT DETECTION
This section contains the tables that describe the different datasets used for the object detection task: • Table 2 contains the datasets used for Clothes Landmark Detection. The name of the dataset, the citation, the year of publication, the number of images contained within the dataset and the number of landmark annotations are reported.
• Table 3 contains the datasets used for Clothes Parsing task. The name of the dataset, the citation, the year of publication, the number of images contained within the dataset and the number of classes are reported.
• Table 4 contains the datasets used for Product Retrieval task. The name of the dataset, the citation, the year of publication and the number of images contained within the dataset are reported. Object detection is one of the best known and most common tasks in the world of deep learning. This section will present the datasets that have been used for this purpose in the fashion world. Object detection can be divided into 3 macro areas: • Clothes Landmark Detection: The purpose of this area is to predict the locations of key points on clothes. These points, as for example, where the collar of a shirt ends or the cuff and hem, are of fundamental importance as they are able to indicate the region of the outfit and delimit it. So, the purpose is to predict which are the locations of the K functional key points defined on the fashion items. Given an image I as input, the aim is to predict the position L of the cloth landmark, where L can be defined as and L k is the position of every pixel (u, v) in the input image. The datasets used for this purpose are listed in chronological order in Table 2.
1548 VOLUME 10, 2022   • Clothes Parsing: Clothes Parsing is a subsection of semantic segmentation, where the clothing items represent the labels. It can be seen as a labeling problem regions of an image. If an image I that shows a person is takes as input, the aim is to attribute a label of a clothing or null (if the background is considered) item to each pixel. Assuming that uniform appearance regions concern to the same item, the problem can be simplified and reduced to a prediction over a set of superpixels. The datasets used for this purpose are listed in chronological order in Table 3.
• Product Retrieval: Given an image that contains fashion styles as input, the purpose of fashion retrieval based on images is to find similar or equal items from an archives of shopping image inside of online sites. Table 4 summarizes the datasets used for the Product Retrieval task. These aspects will be better explored later in Section IV-A.
Some of the most important datasets used for object detection task will be described in detail.
• Fashionista (2012): Fashionista dataset was introduced by Yamaguchi et al. in [8]. This dataset consists of 158 235 photographs collected from the a social networking website Chictopia.com. They observed 53 different clothing items and adding additional labels for hair, skin, and background, their proposed gives a total of 56 different possible clothing labels. The annotation was made by tags, comments, and links.
• Paper-Doll (2013): Paper-Doll dataset was presented by Yamaguchi et al. in [10] and it is an extension of Fashionista. This dataset contains 339 797 images,  [16]. This dataset is composed by 5.867 images of Benchmarks from Fashionista dataset [8], Daily Photos dataset [9] and CFD dataset [12] which represent people with full body view taken from the front or near-front, in which it is possible to see all parts of the body. If an element is selected, it can belongs to only one category. The number of the total attributes extracted is over 3 000, and only those attributes that appear in more than 10 items have been retained, resulting in a list of 990 attributes.
• Fashion IQ (2019): Fashion IQ contains fashion products images coming from a product review dataset [60] by Guo et al. in [48]. They selected three categories of product items, specifically: Shirts, Dresses and Tops&Tees. For each image, they followed the link to the product website available in the dataset, in order to extract corresponding product information, when available. Leveraging the textual information within the website, the authors pulled out attribute labels that contains fashion information from them. In particular, from the product title, the product summary, and detailed product description, product attributes were extracted. In total, 1 000 attribute labels were extracted, further grouped into 5 attribute: texture, fabric, shape, part, and style.
• SIZER (2020): Tiwari et al. in [55] created SIZER, a dataset of clothing size variation of approximately 2 000 scans including 100 subjects wearing 10 garment classes in different sizes, where scans, clothing segmentation, SMPL+G registrations, body shape under clothing, garment class and size labels are available.
• UTFPR-SBD3 (2020): In [56], the authors constructed UTFPR-SBD3, intended for clothing segmentation in the context of soft biometrics. The dataset is composed of 4 500 images manually annotated into 18 classes and an addition class for the background. 1 003 of the images come from the CCP dataset [11], 2 679 from the CFPD [12], and 685 from the Fashionista dataset [8]. Furthermore, 133 images were collected VOLUME 10, 2022 from the website Chictopia.com: they contains instances of the less frequent classes in the dataset, to ensure that each class has at least 100 instances.
• Fashionpedia (2020): From Flickr and other free license photo website, a total of 50 527 images were collected. Then, after filtering the images that contained fashion items, 48 825 images remained and used to build Fashionpedia dataset, collected by Jia et al. in [57]. In this dataset, the annotation of the images are done with one or more fundamental garments. Furthermore, each fundamental garment is annotated with its garment parts.

B. DATASETS FOR FASHION CLASSIFICATION
The identification of clothing in image, is called Fashion classification. There are not many datasets that have been created specifically for this task, but certainly one of the most important and most used is Fashion Mnist. Created by Xiao et al. [32] in 2017, Fashion-MNIST dataset is proposed as a more challenging replacement dataset for the MNIST dataset, that consists of 10-class handwritten digits [61]. The images within this dataset come from the shopping website named Zalando. To construct the dataset, they used miniature images of 70 000 unique products. Those products can contain different gender groups: men, women, kids and neutral. Moreover, white-color products have not been placed inside the dataset since they have not a high contrast to the background. The manually labeled silhouette code of products is used for the labels of class. Each product contains only one silhouette code, for a total of 10 classes (0 = T-Shirt/Top, 1 = Trouser, 2 = Pullover, 3 = Dress, 4 = Coat, 5 = Sandals, 6 = Shirt, 7 = Sneaker, 8 = Bag, 9 = Ankle boots). Examples of images belonging to the dataset are showed in Figure 5 Another dataset used for classification task is CBL. Created by Liu et al. in [54] this dataset is composed by 250 000 images manually label extracted from 25 clothing brands; after the labellization phase, 57 000 images with clear logos are kept to form the CBL Dataset and all of them contains brand and bounding box information.

C. DATASETS FOR CLOTHES GENERATION
Given an image that contains a person, the aim of Clothes Generation is re-wear that person with a different clothing style. It can be done by taking a realistic image containing fashion items and synthesizing it. In this section, the datasets using for this task are presented.
• UT-Zap50K (2014): The UT-Zap50K dataset, introduced by Yu and Grauman in [15] contains 50.025 shoes images coming from Zappos.com. There are 4 relative attributes, open, pointy, sporty, and comfort, and for each attribute, there are 3 000 annotated pairs. Shoe images was annotated using metadata: for example, these metadata can be the gender, the type, the materials and the manufacturer. They first extracted about 19 000 pairs of images of women and clothes, then removed the noisy images, resulting in 16 253 pairs. These images were then further subdivided into training sets, 14 224 pairs, and into test sets, 2 032 pairs. During the test, the person should wear a different piece of clothing from the original one: then the clothing images present in the test pairs are randomly mixed in the evaluation phase.
• Fashion-GEN (2018): The dataset created by Rostamzadeh et al. in [40] consists of 293 008 images. All fashion items are photographed from 1 to 6 different angles depending on the category of the item. Each product belongs to a main category and a more finegrained category (i.e.: subcategory). There are 48 main categories, and 121 fine-grained categories in the dataset.
• FashionTryOn (2019): In order to create their virtual tryon dataset, Zheng et al. crawled 4 327 clothing items that come from the shopping website Zalando.com, with their corresponding model images. With a preprocessing phase, they removed the images considered noisy, that is, those that show only a part of the human body. Furthermore, they extracted the keypoints of each image using a pose estimator, and removed all the images that had fewer than 10 keypoints. Finally, they create the FashionTryOn dataset, with a total of 28 714 triplets.
Each triplet consists of one image that contains a fashion item and two images that represent the original image that contain a person in a certain pose, and the target person in a different pose. In addition, they removed the videos and frames that were considered noisy. In each video there are mainly 250 to 300 frames. The total videos were then divided into training set, which contains 661 videos (159,170 frames respectively), and test sets, which contains 130 videos (30 931 frames respectively). Then, 791 images of people and 791 images of clothes were scanned, so that each video could be made by associating it with a new image of person or a new image of clothes. So, for the training of the network, triplets were considered which are composed of one video, one picture of a person and one picture of clothes.

D. DATASETS FOR CLOTHES RECOMMENDATION
With the growths of online shopping platforms and the social network, Clothes Recommendation systems have seen a huge increase. With this kind of systems, the user experience can be improve and it can bring great profit to shopping platforms. This type of service that is offered to the customer also aims to select and display a series of articles that are online and compatible with the choices already made and seen by the customer. Some datasets used for this purpose are listed below.
• Fashion Style14 (2017)  in [46] created the Clothes Recommendation Dataset searching many brands of clothing websites including H&M, Forever21, Superdry etc., then they crawled the clothing images and record the product information as their label at the same time. In this way, 127.824 images with 7 different brands were reserved and for every image, category, color, material, pair and price are specified.
• FashionAI (2019): Zou et al. [47] introduced Fash-ionAI, an high quality fashion dataset. It takes into account 6 categories of women's clothing and 41 subcategories. This dataset has a hierarchical structure: in fact the categories can be considered the radii and the sub-categories the leaf nodes and each of these has both a dimension and a value. Therefore, the total number of annotations within the dataset is not calculated by adding all attribute values, but by making the product of the number of the attribute values in each attribute dimension. With this process, they obtained 24 different key points and 245 attribute values in 68 attribute dimensions. . Specifically, they extracted millions of posts uploaded by users around the world, running automatic and manual filters on these posts. First, they detected the person's body and face using the pre-trained object detection model. Images that do not include the face or body or those in which the face and body were of abnormal size were discarded and filtered. In this way they collected about 680.000 images.

IV. DEEP LEARNING METHODS
To understand and estimate trends affecting the world of fashion, several tasks must be overcome. These are summarized in Figure 6 and explained in detail in Section IV-A, Section IV-B, Section IV-C, Section V-A and Section V-B. Given the huge amount of data that has been collected within the datasets concerning Fashion, only Deep Learning methods have been analyzed. In fact, these methods are the best performing both in terms of efficiency and time. Fashion-related social media pictures tasks. A score from 1 to 5 is given for the number of datasets and methods developed for a specific task. Furthermore, it is possible to find an example of the image given in input, and the image in output, after being processed through deep learning methods.

A. OBJECT DETECTION
In the field of computer vision, one of the main problems is object detection. As explained by Zou et al. in [62], object detection is an important problem which consists on identifying instances of objects within an image and classifying them as belonging to a certain class. The object detection determines the location and size of object detected.
The models for object detection task can be divided in two macro categories: two-stage detectors and one-stage detectors. In the first case, these models divide the task of identifying objects into several phases, following a ''coarseto-fine'' policy. In the second case, the process of these models tries to complete the detection in a single step with the use of a single network. Below, some of the most famous methods for the purpose of object detection will be reported: tional Neural Networks (R-CNNs) follow a relatively simple process. In fact, they begin by extracting a set of object candidate boxes using a selective search [63]. Therefore, for each proposal, a fixed-size image is cut out and fed to a trained CNN to extract its fundamental characteristics. In the end, linear SVM classifiers are used to decide if the object is present in any specific region and to recognize the categories of the objects found. With selective search, the subdivision of the areas to be proposed takes place in a hierarchical manner, so as to capture the presence of objects in various poses and sizes. The R-CNN models have an important contraindication: the large number of features to be classified, resulting from the overlapping of the many proposed areas (more than 2,000 areas for each image). This leads the model to have to process a large amount of data, compromising its performance. R-CNN is used into the work of Lao and Jagadeesh [64] for the clothing detection task using Colorful Fashion dataset [12].
• SPP-Net (Two-stage-Detector). In 2015 He et al. have proposed Spatial Pyramid Pooling Networks (SPP-Net) [65]. Before SPP-Net, CNN's models required fixed-size input. This entailed a loss of accuracy in the detection of images of different sizes and proportions from those set, which had to be resized, with possible loss of detail or deformation of the image, or even cut. The reason is that, by their nature, CNNs are composed, in the last levels, of fully connected layers, which work on an input of a predetermined size. The innovation of SPP-net lies precisely in having introduced, between the convolutional levels and those fully connected, a level of Spatial Pyramid Pooling, which allows you to put together the features highlighted by the convolutional levels and to return an output of a predetermined size. Using SPP-Net for object detection, the set of proposed areas is calculated starting from the features extracted from the first convolutional levels. Subsequently, representations of predetermined dimensions are generated on the proposed regions. These representations can therefore be fed to the classifiers. Dong et al. [66] combine VGG-19 network with the SPP-Net to recognize the clothing image style.
• Fast R-CNN (Two-stage-Detector). Fast R-CNN, proposed by Girshick [67] allows to train a recognizer and a bounding box designer within a single model simultaneously. The loss function of Fast R-CNN takes into account the error made in each phase of forward propagation. Fast R-CNN takes as input an image and a set of object proposals, i.e. areas that are supposed to contain objects within the image; the network processes the image through a succession of convolutional layers and max pooling layers to extract its features and produce a convolutional feature map. At this point, a pooling layer, called RoI Pooling Layer (Region of Interest), extracts a vector of a predetermined size from the newly obtained map and processes it through two entirely connected layers. Each vector thus obtained propagates in two directions: in both cases it passes through a series of entirely connected levels; in the first case, the output then passes through a softmax level to estimate the probabilities, for each of the K classes of recognizable objects, that in the area (vector) there is an object of the k-th class; in the second, the levels produce four real numbers for each of the K classes of objects. These values encode, for each recognizable class, the center and the dimensions of the corresponding bounding box.  [71] produces a total of k 2 position-sensitive score maps with a fixed grid of k × k. Then, a position sensitive RoI pooling layer is added to join the responses from these score maps. Finally, in each RoI, k 2 position-sensitive scores are averaged to produce a vector and softmax responses between categories are calculated. Another convolutional layer is added to obtain class-independent bounding boxes.
• FPN (Two-stage-Detector). Feature Pyramid Networks (FPNs). Before FPN, created by Lin et al. in 2017 [72], most of the deep learning-based decoders performed recognition only on features obtained from the last layers of the network. The starting point is the feature map of lower resolution and higher semantic value, i.e. from the one obtained in the last convolutional level (i.e. the tip of the pyramid). To this is concatenated the feature map of the next level of the pyramid, which has a higher resolution but less semantic value. We proceed in this way until we arrive at the feature map which forms the base of the pyramid. The predictions are made on each of the feature maps thus obtained. The result is a pyramid of features, rich in semantics at every level, quickly built from an image, without this having to be resized. For example, Martinsson and Mogren [73] proposed a fully convolutional neural network based on feature pyramid networks to approach the problem of semantically segmenting fashion images into different categories of clothing.
• YOLO (One-stage-Detector). YOLO [74] network was proposed by J. Redmon [80] used SSD to construct a developed version, called CDSSD, to facilitates unsupervised training of the underlying network architecture, with the aim of extract fashion trends from social media. As already mentioned in section III-A, in the world of fashion when we talk about object detection, we refer to three different tasks: Clothes Landmark Detection, Clothes Parsing and Product Retrieval.

1) CLOTHES LANDMARK DETECTION
Detecting fashion landmarks from an image is a fundamental and practical task, whose goal is to predict the location of useful keypoints defined on fashion items, such as the corners of the neckline, hemline, and cuff.
Extensive research has been devoted to detecting fashion landmarks and has achieved excellent performance.
The first to introduce neural networks for this task was Liu et al. in his work [20]. Clothes Landmark detection is seen here as a regression task and they created the FashionNet network for direct regression of landmark coordinates. Liu et al. [21] design pseudo-labels to improve the invariability of fashion landmarks. Yan et al. [25] combine selective dilated convolution and recurrent spatial transformers to detect clothing landmarks in unconstrained scenes. In all the above methods, the benchmarks are estimated separately for each landmark point, and therefore there is a possibility of detecting ambiguous and inconsistent landmark points from the structure. Inspired by the attentional mechanism, Wang et al. [81] proposed an attentive grammar network with high human knowledge to globally predict landmarks's positions. At the same time, they point out that the fashion landmark regression is a problem with an highly non-linear level and it is very difficult to learn directly. Therefore, they learn to predict a confidence map of the position distribution for each landmark. Chen et al. [82] also adopted this method for mode landmark recognition: They proposed a Clothes Landmark Detection network based on Feature Pyramid Network and designed the Dual Attention Feature Enhancement (DAFE) module to improve the feature representations while recovering the size of the feature maps. Li et al. [83] inspired by visual attention mechanism [84] and non-local block [85], proposed Spatial-Aware Non-Local (SANL) block, which encodes prior knowledge taking into account a spatial attention map. The more current method proposed by Yu et al. [86] define a complicated fashion layout-graph and propose to model the structural layout relationships among landmarks. However, they propagate the information according to a fixed layout-graph and cannot deal with the diverse deformation or occlusion. Recently, Chen et al. [87], proposed a novel framework, called Adaptive Graph Reasoning Network (AGRNet), for Clothes Landmark Detection. It introduces graph-based reasoning to adaptive impose structural layout constraints among landmarks on the deep representations. The best results for both FLD Dataset and DeepFashion Dataset are provided by Kai et al. [88] with their MDDNet Network: in fact, MDDNet achieves the best NE score in average of 0.0267 on FLD Dataset and of 0.0251 on DeepFashion Dataset compared with other fashion networks.
The latest works developed for this task are those of Kim et al. [89] and Song et al. [90]. The first is an innovative method based on a one-stage detector that aims to reduce the high computational costs required by large-scale datasets. This network, which is an adaptation of the EfficientDet, developed by Google Brain, can perform two tasks very quickly: the first is that of detecting multiple clothes within the image, while the other is that of identifying landmarks. Through this adaptation, the authors achieved an accuracy of 0.686 mAP in bounding box detection, and 0.450 mAP in landmark identification: the procedure was very fast, there being a very rapid inference time of 42 ms in each single GPU. In this way the authors have tried to solve the problem that arises when large datasets are chosen, that is the balance that must exist between accuracy and speed.
The second work, on the other hand, was proposed to be able to solve the problem of occlusions that can be found in the images. In particular, the authors developed a new Loss function, called Position Constraint Loss (PCLoss) which uses the position relationships between the various landmarks to understand which of these are wrong, regularizing their position using a regularization term for each landmark.
The results of all the methods are reported in Table 5 and  Table 6.   Table 5 presents the Performance of state-of-the-art methods for Clothes Landmark Detection using FLD [21] Dataset in terms of normalized error(NE). Table 6 shows the Performance of state-of-the-art methods for Clothes Landmark Detection using DeepFashion [20] Dataset in terms of normalized error(NE).

2) CLOTHES PARSING
The purpose of object parsing is to understand the contents that are inside an image in a detailed way: this is done by segmenting the image into regions that have a different semantic meaning. In particular, fashion parsing and human parsing with clothing classes aims to resolve the problem of finding significant regions within images that contain people with certain clothes on. Similar to the semantic segmentation task, object and label diversity is a challenge not closed for human parsing. And, unlike classic semantic segmentation tasks, such as the parsing of a scene, the purpose of human parsing is both to understand the different parts of the person in the input image, and to assign the right label to each clothes that the person wears. Unlike semantic segmentation, human parsing also requires that the methods used to solve this task, can withstand large variation in occlusion, pose, lighting and viewpoints. Therefore, it is not advisable to apply semantic segmentation frameworks directly to human parsing. Moreover, the hand-developed algorithms for human parsing are also not very powerful as they are not robust and inflexible in adaptation.
Yamaguchi et al. [8] were the first to tackle the task of Clothes Parsing. In their work, In their work, they considered the problems of clothing parsing and pose estimation, and refined them by considering the relationship between the two. However, this method was mainly based on solving a problem in which the images that were analyzed had been labeled through tags that had been provided by the user and that indicated the articles of clothing depicted. To overcome this limitation, Yamaguchi et al. [10] proposed garment parsing using a retrieval-based approach. Given an image as input, the first step were to retrieve similar images from a dataset; the second step was find the closest parsings that were then applied to the final result via dense matching.
Liu et al. [92] proposed a quasi-parametric parsing framework called Matching Convolutional Neural Network (M-CNN), which is able to fully utilise the monitoring information from the annotation training data and extend it in the meantime for new added labels.
Inspired by the performance in traditional classification and recognition tasks, Liang et al. [16] used Deep Convolutional Neural Network to construct an end-to-end relationship between the input image, that consist of an human image, and the outputs.
Liang et al. [93] solved this problem by incorporating the LGLSTM layers into CNNs instead of learning features only from local convolution kernels, as in [17]: this step was done to taking into account both long and short distance spatial dependencies. Moreover, it is possible to store the previous contextual interactions from neighboring locations and the complete image in previous LG -LSTM layers, by adopting hidden and memory cells in LSTMs,. Furthermore, they introduced Graph LSTM structure to capture long-distance dependencies on the superpixels. As can be seen from tables 7 and 8, the latter method is the best in accuracy and average F1-score. Table 7 proposes Performance of stateof-the-art methods for Clothes Parsing using Fashionista Dataset in terms of Accuracy, F.G.Accuracy, Average Precision, Average Recall and Average F1-Score in percentage. Table 8 exhibits Performance of state-of-the-art methods for Clothes Parsing using ATR Dataset in terms of Accuracy, F.G.Accuracy, Average Precision, Average Recall and Average F1-Score in percentage.
Ye et al. [94] introduced a new network called FinerNet, which first segments the human foreground region. The following stage then takes as input the original input image and the results of the last stage as input to attribute finer labels to each pixel. Moreover, by effectively using human posture features, the network can achieve better segmentation results. In fact, FinerNet performs better than the other state-of-theart methods in F.G.Accuracy, Average Precision and Average Recall in the ATR dataset.
The various state of the art methods used for Clothes Parsing have been reported below. In particular, the results were reported on five different types of datasets: Fashionista, ATR, CCP, CFPD and CIHP. Table 9 presents Performance of stateof-the-art methods for Clothes Parsing using CFPD Dataset in terms of Accuracy and IoU in percentage. Table 10 shows Performance of state-of-the-art methods for Clothes Parsing using CCP Dataset in terms of Accuracy, F.G.Accuracy, Average Precision, Average Recall and Average F1-Score in percentage. Table 11 proposes Performance of state-of-theart methods for Clothes Parsing using CIHP Dataset in terms of Pixel Accuracy, Average Accuracy and Average IoU in percentage;

3) PRODUCT RETRIVIAL
Given the rapid development of e-commerce sites, which has resulted in an increase in online shopping, many researches have dealt with the task of product retrieval based on images   or videos. This type of study manages to make consumers and the computer interact: given an input image, in fact, the consumer is allowed to be able to provide additional information on the desired attributes. Although this is a very    recent task, the first work is in fact that of Wang et al. [115] in 2019, many systems have already been developed.
Wang et al. [115], started from the sketch-based-image performance retrieval (SBIR) method, and developed it by adding a re-ranking approach based on multi-clustering. Furthermore, they propose an unsupervised method using blind feedback, in order to make the re-ranking approach  adaptive to different types of image datasets and invisible to users.
The paper proposed by Peng and Chi [116], uses the Domain Adaptation with Scene Graph (DASG) approach: the purpose of this method is which transfer knowledge from the source domain to improve cross-media retrieval in the target domain.
In the study conducted by Nie et al. [117], the authors proposed an end-to-end deep hashing method called deep multiscale fusion hashing (DMFH) to perform the crossmodal retrieval task. In particular, they built different network branches for two modalities and then used multiscale merging models for each branch: this was done to merge multiscale semantics which can then be used later to explore semantic relevance.
A novel deep hashing method, proposed by Wang et al. [118], is based on pairwise similarity-preserving quantization constraint, referred to as Deep Semantic Reconstruction Hashing (DSRH). In order to learn compact binary codes, they developed a high-level semantic affinity within each data pair.
Some works, such as those that can be found in [18], [20], [119]- [121], improve the performance for the task of product retrieval in fashion word by including supplementary semantic information. Instead other works, such as [122]- [125], concentrate the attention on training a mode retrieval model with losses that were specially designed. There have also been works that have tried to optimize the representation of characteristics instead, such as [122], [126], [127]. The work developed by Ji et al. [128] employed the attention mechanisms in Fashion Retrieval focusing on some significant regions of the image.
FashionNet, proposed by Liu et al. in [20], to perform this type of task, includes attribute and landmark information.
The method proposed by Su et al. [129] is the best among the other methods mentioned before: the novelty is that it integrates the attribute and landmark information with a bilinear attention pooling module.
The most recent works that have been developed in the field of fashion retrieval are three. The first is that of Sharma et al. [130]. The difference with the previous methods is the fact that they used two different datasets: the source dataset and the target dataset. In this way, they built a cross-domain retrieval model, trained on the source dataset, and tested to a new unlabeled dataset. Thus the entire model is unsupervised.
Instead D'Innocente et al. [131] proposed a method in which an image and the position of the points of interest that identify the attributes required within the image are passed in input. To achieve this representation, points of interest are mapped into a coordinate system using bilinear interpolations. The generated feature map is then passed through a convolutional layer. The model is then driven by a loss function called localized triplet loss [132], which searches for similar images, considering the similarity between similar points of interest. Similarly, Dong et al. [133] have proposed a network that is made up of two branches: the first branch takes the whole image as input, while the second takes as input only the part of the image that is of interest. This crop is obtained using a specific location method. The joined network was called Attribute-Specific Embedding Network (ASEN). Tables 12 and 13 show the results of state-of-the-art methods with respect to Top-20 Accuracy. Given a query image, top-20 accuracy is calculating using the Euclidean distances between the query image and all images in the gallery set. In this way, top-20 accuracy is the ranking in an ascending order of the distances. If the ground-truth gallery image is found in this ranking, the retrieval will be considered as a success. Table 12 is more detailed as it shows precisely the top 20 accuracy per attribute, i.e. Dress, Leggings, Outwear, Pants, Skirts and Tops. In both cases the best performances are those of the AHBN method [129]. Table 13 presents the Performance of state-of-the-art methods for Product Retrieval using DeepFashion with Consumer To Shop Benchmark (Table (a)) and with In-Shop Benchmark (Table (b)).

B. FASHION CLASSIFICATION
ML and DL techniques bring great benefits to image recognition and classification in the fashion environment. In fact, they can help to improve the user experience [147], which is a fundamental factor for the calculation of the Key Performance Indicator (KPI), which can be measured   through factors such as the time spent by the user in front of the computer, the purchase volume and average checkout value.
Deep Learning methods, and in particular Convolutional Neural Networks, can help the user to have a more pleasant experience on the site, being able to make a quicker and more convenient search of the products. As a consequence, there will be an increase in KPIs, in the business profits and in the efficiency of the product management system. An online store that is multi-brand has to group the products and establish the rules necessary for unification and quality standard. When a brand shop proposes some products to the multi-brand online store, the manager reviews the incoming products and decides whether to approve or reject them. This methodology meets two different problems. The first problem is that paying the individual people who carry out the supervised learning process becomes a very expensive process. The second problem is that the time frame for carrying out this type of human reviews of different products in different stores is very long. One way to reduce costs and times, and consequently increase the performance and quality of the results, is the use of automatic systems based on CNN.
Considering the importance of clothing in society, there are many applications for Fashion Classification. An example is the prediction of clothing details in an image, that can help find similar clothing items in a dataset from e-commerce sites. Analogously, Fashion Classification based on user preferences can be used to provide recommendations to the user.
Some problems and issues must be considered in the Fashion Classification, to make these applications effective. In particular, the difficulties caused by the clothing property must be considered: [148]- [151]: 1) Same clothing can be considered different depending on the point of view, and different clothing can be considered the same (the lower part of a dress that is particularly short can be classified as a skirt); 2) Clothing can be easily deformed by stretching or folding; 3) A picture of clothing can change; for example, the images can only contain the type of fabric, or models wearing a dress with that same fabric.; 4) The images can be very different from each other, in the sense that they can have many different conditions, including different angles and lighting, cluttered backgrounds, and partially hidden by other objects or people; 5) Some classes of clothing have almost identical features and can be confused with each other. For example, the pants and tights classes are two classes that are very similar to each other and very difficult to distinguish; 6) some clothing classes are very difficult to identify. For example, this may be due to their small size, such as accessories. Therefore, algorithms that achieve high classification performance for multi-class fashions are needed. For these reasons, DL methods and specially CNN are the most commonly used applications for this task.

1) CONVOLUTIONAL NEURAL NETWORKS FOR CLASSIFICATION TASK
The first CNN architecture to be known and enhanced is AlexNet [152]. This network classified the images within ImageNet dataset: in this dataset there are 15 million of images and 22 000 categories. AlexNet had a total of 8 layers: 5 convolutional layers, some of this followed by dropout layers or max-pooling layers, and 3 fully connected layers. As activation function, the Rectified Linear Unit function (ReLU) is used, which improved the performance in terms of speed over the Hyperbolic Tangent (tanh) function and the sigmoid function. The data augmentation is performed using patch extractions, translation of the image and horizontal mirroring. The best results of this network is an error rate of 16.4% for the top-5 test.
Another network developed and used for the classification task is ZFNet [153]. This network improves the results of AlexNet by reaching an error rate on top-5 test of 11.7%. There are one main difference between ZFNet and AlexNet: AlexNet uses a filter of size 11 × 11 in the first layer, instead ZFNet utilizes a smaller filter size (7 × 7) with reduced stride value. This choice was made for a main reason, that is to be able to keep much more information about the input volume inside the layers. Another contribution of ZFNet is the possibility of visualization of the weights and filters into the architecture: in fact, the authors in fact, using a deconvolutional network, have developed a new visualization technique: this will allow to find dissimilar feature activations and the relationship with the input. Also this network, as previously done by AlexNet, uses ReLU as the activation function, and the categorical cross-entropy as loss function.
In 2014, VGG network was introduced by Simonyan and Zisserman in [154] with 7.31% error rate. This network is composed by 16 (VGG16) or 19 (VGG19) convolutional fully connected layers: the filters have size 3 × 3 and the pooling layers have size 2 × 2. Using a smaller filter many times is convenient as this reduces the number of parameters, but does not decrease performance. Moreover, at the end of each pooling layer, the number of filters is doubled. In this way the spatial dimension decreases, while the depth increases. The data augmentation is performed in this network through scale jittering, and ReLU function is used as activation function.
To improve the performance of VGGNet, Szegedy et al. [155] built GoogLeNet, with an error rate of 6.7%. This network introduced for the first time the concept of Inception Module, i.e. a parallel layer structure. This module consists of parallel connections with filters of different sizes (1 × 1, 3 × 3, 5 × 5):filters of different sizes are used to be able to process the input image in scales with different dimensions. Max pooling layers, with a sized of 33 × 33, are merged to each parallel connection. The output of these layers are then concatenated for the module output. Furthermore, a bottleneck layer with size 1×1 is applied: this was done to decrease the number of channels and the number of weights for each filter. At the end, VGGNet is composed by a total of 22 layers: at the beginning there are 3 convolution layers; then 9 inception layers and each of these is followed by 2 convolution layers, and 1 layer fully-connected.
One of the most important network for the classification task is ResNet, created by He et al. [156]. It reached an error rate of 3.57%. This network has 152 layers, and thanks to the residual learning, it can go up to the layers deeper without degrading the output.
In addition, several methods with different configurations have been developed. Furthermore, the networks can been integrate with other methods. the outcome is am hybrid method.
The most recent work in the field of classification is the one conducted. from Kuang et al. [157]. The authors have built a hierarchical system that manages to classify clothes within large datasets. In particular, using multiple deep CNN branches, they performed the task of classification. In addition, to be able to improve performance, they added a hierarchical method, and finally applied it to the recommendation task.

2) PERFORMANCE COMPARISON ON FASHION-MNIST DATASET
A lot of CNN architectures, such as LeNet [61], Alex Net [158], Google Net [155], VGGNet [154] and ResNet [156], some of these already mentioned in the previous paragraph, have been used in image classification.
In addition to these other types of work have been carried out for this task. Convolutional Neural Network are trained to classify images of Fashion MNIST dataset. McKenna [159] proposed a model to add and compare the sigmoid features, ELUs and ReLUs of the missing benchmarks in the Fashion MNIST dataset. First, the missing multilayer non-convolutional feed-forward neural networks in Fashion-MNIST as a benchmark. Then, testing the effectiveness of the contemporary activation features (compared to ELU, ReLU and Sigmoid).
Xiao et al. [32] created a dataset where the Fashion MNIST images are converted into the format corresponding to the MNIST dataset, which is easier to perform. Greeshma and Sreekumar [160] suggested a system to classify the fashion items in the Fashion MNIST dataset using HOG features and a multi-class SVM as classifier of the network. Table 14 shows the performance comparison between some of the methods present at the state of the art for Fashion Classification, using Fashion-MNIST Dataset. The best performance on this dataset is provided by LeNet-5 Network [161]. This network has the following architecture:  3 convolutional layers, 2 subsampling layers and 2 fully connected layers. At first, a convolutional layer takes in input an image with 32 × 32 size and in gray scale. Then there are six 5 × 5 convolutional filters with a stride of 1. The second convolutional layer apply a filter of size 2 × 2 with a stride of 2, after the input is passed through a layer of pooling. These kind of layers are then connect to the first, the second and the fifth convolutional layers. A fully connected layer is added, the last layer is connected to a softmax that will give the image classification output. This network reach an accuracy of 98, 80%.

C. CLOTHES GENERATION WITH GANs
In recent years, one of the most developed topic in the Deep Learning world, was Generative Adversarial Networks (GANs), created by Goodfellow in 2014 [162]. Their importance is due to the fact that they have proven to be excellent in many areas, especially in the generation of images [163], [164] and in image processing [165], [166]. They have also shown interesting results in generating novel images, e.g. faces [164], indoor scenes [167] and fine-grained objects [24], [168].
For the purpose of this paper, the main argument is focused on Clothes generation with GANs [41], [169], [170].
The task consists in taking two images as input: the first contains the image of the clothes to try on, while the other contains the image of a person who will be in a certain pose and who is already wearing clothes. This is a very difficult task especially because the pose of the person can bring various problems, for example some parts of the body may be hidden or even the pose does not allow you to see some of them. Existing methods perform this task using three different networks, each of which is used for a specific purpose. The first network will have the task of carrying out a similar transformation to be able to align the desired clothes in the desired person; the second network will have the task of dressing the person; the last network will have the accomplishment of carrying out a post-processing phase to try to make the final image as realistic as possible.
In this context, Yoo et al. [24] produced a dressed person determined by clothes image and vice versa, without considering the person's pose.
Another work, proposed by Lassner et al. [139] it is based on the production of a generative network of dressed people.
FashionGAN, created by Zhu et al. [169], uses textual descriptions to succeed in replacing a dress that a person is already wearing with another.
The first to use the generative network to resolve the task of text-to-image synthesis, capable of generating low-resolution images were Reed et al. [168]. StackGAN++ [171] has the purpose to produce images as realistic as possible, using tree methods with multiple discriminators and generators. Another model, with a structure similar to StackGAN++, is AttnGAN [172]. It is composed by additional attention modules and deep similarity model. Emir et al. developed a new adversarial network called e-AttnGAN [173] which includes an integrated attention module that, during the image generation process, incorporates word and sentence context characteristics. This image generation process is done by Feature-wise Linear Modulation (FiLM) layers [174] that can control the visual features without using a supervised approach.
Viewing the successes of GANs, conditional GANs were developed by [175], [176]: they incorporate a specific conditional constraint in such a way that it acquires knowledge from specific condition given to the network, in order to produce realistic fake images. This type of GANs include, as conditional input, a binary mask: it is done by connecting it with the input image, as in the work proposed by Ma et al. [177], or with a latent noise vector, as in the work developed by Park et al. [178].
Within fashion environment, generative networks have been used to exchange the clothes of a person in a source image with some other clothes on a target image. This task is know as Virtual Try On Networks (VTON) [41], [179], [180]. These methods generally have a the same structure that consists of three main components executed by different networks: VOLUME 10, 2022 • a network that has the task of alignment of the target cloth to the source image by learning an affine transformation [181]; • a network that has the task of stitching and swapping, that is usually a GAN; • a refinement network, that changes from method to method. Pandey and Savakis [182] introduced the network poly-GAN which allows to build the output starting not only from two but using more inputs and it can be used to perform different tasks. This is a very important network as it is the first example of a network that performs all three tasks that have been described above.
Jiang et al. [183] proposed two different framework: the first performs fashion style generator task, called FashionG, for single-style generation; the second performs a spatially constrained FashionG, called SC-FashionG, for mix-andmatch style generation. Both this network are end-to-end forward networks which comprise a fashion style generator and a patch-global style discriminator. The inputs are composed by cropped images of online shopping apparel products and full images of street fashion photographs.
The work proposed by Liu et al. [184] investigates clothing match rules based on semantic attributes according to the GAN model. Specifically, an attribute GAN was proposed to automatically generate clothing match pairs. The core of Attribute-GAN constitutes training a generator, supervised by an adversarial trained collocation discriminator and attribute discriminator.

1) VIRTUAL TRY ON
To perform the virtual try on task, which is very competitive especially in the last few years, many works have been carried out.
Han et al. [41] presented a network called VITON that uses a coarse-to-fine for seamlessly transfer a desired garment to the right area of a person within an image using only 2D information. In this process, a rough fitting result is first generated and the mask for the garment is predicted. To make the result more accurate they then introduced a refinement network, using the mask and the coarse result. However, this kind of framework does not work well when there is a large deformation. To solve this problem, Wang et al. [179] introduced a novel architecture called CP-VTON, that means Characteristic-Preserving VTON. Using a geometric matching module, it can better manage the spatial deformation by aligning the input clothing with the body shape.
A development of these works are FashionGAN [169] and M2E-TON [185]. FashionGan creates the image based on the text description: with as input a fashion image and a sentence that describe a dress other than that worn by the person, this generative network manages to dress the person in the manner described by the sentence. A GAN creates a segmentation map according to the description, and another GAN, driven by the segmentation map, generates the output image. On the other hand, M2E-TON creates the image based on the model's image: it manages to dress one person with the clothing of another, whose image is passed on as input. Also these two people can have different poses.
Fit-Me [186] is the first work that performs the virtual test allowing arbitrary poses. In fact, the authors designed an architecture from coarse to fine both for the transformation of the pose and for the virtual test.
Hsieh et al. in [50], created FashionOn, which applies semantic segmentation and refines the face part and clothing region, aiming to achieve a more realistic output image and solve the human limb occlusion problem in CP-VTON.
ClothFlow [187], proposed by Han et al, focuses on clothing regions for more natural results.
Unlike the works described so far, which work on images, FW-GAN, presented by Dong et al. [51], is a model that learns to generate a video of a moving person based on a person image, the desired clothes image, and a sequence of target poses. FW-GAN wants to synthesize the coherent and natural video through a manipulation of poses and clothes.
Recently, Fincato et al. proposed a new solution for this task, called VITON GT, where GT means Geometric Transformation, where a multi-stage geometric transformation is performed to reduce distortions and artifacts in the generated images. Within this model there are two different parts: • a two-stage geometric transformation module, where an affine transformation and a warping transformation are performed to shift the desired clothes in the target person and to generate a warped version of the desired clothes; • a transformation-guided try-on module, where the results are generated.
Furthermore, they added an adversarial training into the second module, to make the output image more realistic, and they To increase the realism of generated images, they integrated adversarial training in the second stage of their architecture and they designed a finetuning scheme to improve the final image quality even more.

2) POSE GUIDED GENERATIVE MODEL
Another very interesting task, which has developed especially over the last few years, is being able to change a person's pose.
The first work dedicated to this type of problem is the one done by Ma et al. in [177]. They adopted a divideand-conquer method, separating the task into two different steps: the first step consists on acquire knowledge of the body structure of the person, and it is performed by a variant of U-Net [188] that combine the desired pose to the person image; the second step consists on learning the particular details of the physical appearance to improve the results using a modified Deep Convolutional GAN (DCGAN) model. The model learns to fill in more appearance details via adversarial training and creates sharper images. Unlike the previous methods that use GANs to generate directly an image, in this method, GANs are used to create a difference map that connect the image within the target person and the image that comes from the output of the first step.
Balakrishnan et al. [189] constructed a network that takes as input two different images: a source image with a source 2D pose, and a desired 2D target pose, and creates an output image. This network first segments the source image into a background layer and multiple foreground layers corresponding to different body parts, allowing it to spatially move the body parts to target locations. The moved body parts are then modified and fused to synthesize a new foreground image, while the background is separately filled with appropriate texture to address gaps caused by disocclusions. Finally, the network composites the foreground and background to produce an output image.
Dong et al. [190] proposed a Soft-Gated Warping-GANto be able to solve the problems that are created when the geometric transformations create the poses, in particular the problems of spatial misalignment. This method consists in two different steps.
• To create a part segmentation map, there is a poseguided parser: when the target pose is known, it can better generate the image with an high-level structure limitation, describing the spatial layouts.
• To provides accurate representation within any segmentation part, a Warping-GAN is used: this type of network can learn geometric connections that exist between the original image and the pose that comes from the predicted segmentation map. To solve the task of shape-guided image generation, conditioned on the output of a variational autoencoder for appearance, Esser et al. [191] presented a conditional U-Net [188].
Lassner et al. [192] proposed a generative model of people called ClothNet. This network takes 3D information form an image that contains a body model, and it is data-driven. In their work, they presented two version of this network: a simpler model, called ClothNet-full, that can randomly generate images that contains people from a learned latent space, and a conditional model, called ClothNet-body, that produces random people with a pose similar to the target pose but with different garments.
Ma et al. [193] created a structure that can learn manipulable embedding features that came from three different factors: foreground, background and pose. The structure is composed by two principal steps: in the first step, there is a network that can encode the multi-branched reconstructions, to generate the image; in the second step, an adversarial network does the sampling task using mapping functions.
In contrast, the GAN-based method proposed by Siarohin et al. [194] is end-to-end trained by expressly considering pose related spatial deformations. In particular, basing on the structural deformations that are presented in the conditioning variables, they proposed crumple zone skip connections which move local information. These layers are employed in a U-Net [188] based generator. Pumarola et al. [195] proposed a fully unsupervised GAN framework that, considering a photo of a person, automatically generates images of that person with a novel camera views and distinct body poses. They proposed a GAN architecture that combines the pose conditional adversarial networks [196], Cycle-GANs [197] and the loss functions used in image style transfer with the purpose to produce novel images of high perceptual quality [198].
Zhu et al. [199] created a model that transfer the pose of the person in the target image during the encoding phase. It is done by an attention based progressive system [200]. Both the above methods extract the image features during the encoding step and merge the extracted representations before the decoding phase.
Another work is that conducted by Yildirim et al. [201]. Here, StyleGAN [170] is applied to the task of image generation. A constant vector is taken as content input and a combination of pose and garments information is taken as the style input.
Huang et al. [202] introduced an image generation method using an end-to-end system that consists of an appearance encoder, that learns the appearance representation of the person within the image, and an Appearance-aware Pose Stylizer (APS), where the image is progressively created from a small to a large scale, to improve the final image quality by adding as much detail as possible.

3) PERFORMANCE OF THE NETWORKS
The performance of the networks that have been discussed so far are shown below. The datasets used for the purpose of the Clothes Synthesis were Fashion-GEN, Deep Fashion, VITON and Zap-Seq & DeepFashion-Seq. collected from UT-Zap50K and DeepFashion through crowdsourcing via Amazon Mechanical Turk (AMT) [203].
• Fashion-GEN dataset Table 15 proposes Performance of state-of-the-art methods for Clothes Synthesis using and Fashion-GEN Dataset. The evaluation is computed in terms of Inception Score metric (IS) and R-precision.
• Deep Fashion dataset Table 16 exhibits the Performance of state-of-the-art methods for Clothes Synthesis using and DeepFashion Dataset. The evaluation is computed in terms of Inception Score metric (IS), R-precision and classification accuracy.
• VITON dataset Table 17 presents the Performance of state-of-the-art methods for Clothes Synthesis using VITON Dataset. VOLUME 10, 2022   The evaluation is computed in terms of Structural Similarity Index metric (SSIM) and Inception Score metric (IS).
• ZAP-Seq AND DeepFashion-Seq datasets Table 18 shows the Performance of state-of-the-art methods for Clothes Synthesis using Zap-50k-seq (a) and DeepFashion-Seq (b) Dataset. The evaluation is computed in terms of Structural Similarity Index metric (SSIM) and Inception Score metric (IS).

V. DEEP LEARNING FOR SOCIAL MEDIA ANALYSIS
Extracting fashion knowledge from general users within social networks, such as Instagram, is a very important source of data as the images published on social media generally have ideas for different occasions or person identity information. However this is a very difficult and competitive task to perform. In fact, many images that are present throughout the social networks present descents that are difficult to understand, and it is not easy in these contexts to go and extract fashion concepts, especially for the fact that the datasets used for the tasks described above, they were based on images that contained a white background. Furthermore, the images present in social networks do not have a label already inserted, or at least not sufficient, and these data are fundamental for the construction of fashion knowledge. A possible solution could be to manually label the dataset: however, this is a solution that requires a large amount of time and carries a relatively high cost.
The datasets found in the literature mainly contain images that come from online shopping sites and that only have specific attributes relating to that specific item, and of course this data is not enough to be able to identify the type of occasion or the identity of people within an image.

A. AUTOMATIC FASHION KNOWLEDGE EXTRACTION
In recent years, Automatic Fashion Knowledge Extraction has been the focus of many studies in the field of computer vision, Machine Learning and Deep Learning. Some of these are listed in the following list.
• YAGO (Yet Another Great Ontology), is a knowledge base developed in Saarbrücken, Germany, by researchers from the Max Planck Institute for Computer Science in 2006 [204] and has undergone continuous improvements and extensions over the years, which continue to this day. The German experience in question combines broad coverage with high accuracy [205] and is therefore significant in the context of the semantic Web. YAGO represents all facts in the form of unary relations or binary: entity classes and pairs of entities linked by specifications relations. The data model can be seen as a graph in which entities and classes are the nodes and in which the relations are oriented arcs. The knowledge is organized in subjectproperty-object RDF format [206]: two adjacent nodes and the arc that connects them make up a triple [204]. The core of YAGO is based on Wikipedia, one of the most complete digital encyclopedias available [205]; • Freebase, [207], is a system in which all knowledge in the world is made public. In terms of design it can be compared to Wikipedia and the Semantic Web.
Inside there are about 125 million tuples, about 4 000 categories and about 7 000 characteristics; • Wikidata [208] is also a public knowledge gathering system that can be accessed and modified by any person, or computer, who enters it. It provides centralized access to structured data management to Wikimedia projects including Wikipedia, Wikivoyage, Wikisource and others.
• DBpedia [209] is a knowledge extraction system based on wikipedia. This process is done through Semantic Web and Linked Data technologies. Data extraction is done in 11 different languages and is made up of 400 million facts describing 3.7 minions of events The main problem of all the dataset described above is that they were curated by only textual resource, ignoring the rich information that is in the visual data.
Subsequently, research studies have focused on extracting knowledge from visual rather than textual data such as • NEIL [210] is a program that runs every day for 24 hours that was created to extract internet knowledge at the level of visual data, such as images. This mechanism is based on semi-supervised algorithms that are able to find relationships between the images and the texts written under each of them.
• Visual Genome [211], a dataset that is able to modeling the relationships between objects present within an image. They gathered dense annotations of objects, attributes, and relationships inside each image to learn these models. Especially, this dataset includes over 100K images where each image has a mean of 21 objects, 18 attributes, and 18 pairwise relations between objects. The objects, attributes, relationships, and noun phrases are canonicalized in region representation and questions answer pairs to WordNet synsets. Mutually, the notes denote the most dense and highest dataset of image representation, objects, attributes, relationships, and question answers.
• Video Visual Relation Detection (VidVRD) [212], perform visual relation detection in videos instead of still images (ImgVRD). It consists of tracklet proposal, short-term relation prediction and greedy relational association. Moreover, it is a dataset for VidVRD evaluation, which contains 1 000 videos with manually labeled visual relations. Although most researches had the aim to derive knowledge considering the visual data and the sentences related to the visual data both textual and visual data, but there are not many works that do these types of extraction in the fashion world.
Following the works done in [52], the problem of extract fashion knowledge directly from social media can be formulate in the following manner: an user fashion knowledge can be defined as a triplet of the form K = {P, C, O} where • P is Person, with a set of attributes (age, gender, height, weight, etc.) • C is Clothing, with a set of clothing categories and attributes (dress, long, short, material, etc.) • O is Occasion (wedding, dating, etc.) and it is accompanied with its own metadata (weather, time, location etc.). A set of post X within any social network is then referred to as a triplet of objects X = {V, T , M}, where V is a set of images, T is a set of texts, M is a set of metadata. The aim is therefore to create an automatic knowledge extraction system that is able to provide all three useful information to describe a post about the world of fashion K.
In this context, Ma et al. [52], have built FashionKE: this is a large dataset that consists of 80 629 images and each of these has the three characteristics written above (person, clothes, occasion) clearly identifiable. The extraction process takes place through three different steps. The first step consists in the pre-training of an object detection system, with a pre-trained ResNet, which serves to identify whether a person is contained within an image and the vector of each clothing region; the second step consists in filtering all those images that do not contain people, and those that do not have a too deformed body and face; in the last step, they manually control all the images and eliminate those that cannot show any occasion. At this point they used a bidirectional Long-Short-Term Memory (Bi-LSTM) network to be able to learn the regions that contained a dress within the image and the occasion in which it was worn. The final hidden representation for each clothing region is the union of the hidden vectors in both directions.
The most recent work of Ma et al. [59], proposes a novel dataset proceeding from the popular social network Instagram, called Fashion Instagram Trends (FIT). Specifically, they extracted millions of posts contained in the social network from all over the world. The collected data were automated and manual filtering, in a similar way to the work of [52], [213]. The first step of this work is the same as the previous one [52], i.e. the detection of body [76] and face [214]. Also the second step is almost the same: the images with partial or deformed body or face, are eliminated. Another step, unlike the previous work, is the deleting images that contained people who did not belong to the account where the images were downloaded. At the end, a total of 680 000 images were obtained. Moreover they proposed a novel knowledge system based on LSTM model, called KERN, which also manages to take into account the time series in fashion trends.
In 2021, Parekh et al. [215], extracted attributes from images on an Indian e-commerce site. They also proposed a framework that uses attention mechanisms to carry out multitask training, and also tries to balance the datasets as well.
The table 19 considers the algorithms of automatic fashion knowledge extraction for the methods described in this section.

B. CLOTHES RECOMMENDATION
Clothes Recommendation cannot be general, since the preferences of user are naturally subjective. In fact, they depend on the age, occupation, culture, place of living, and so on. From this perspective, the personalisation is fundamental since it guarantees that the Clothes Recommendation agrees with the personal taste of users and includes their likes and dislikes from several perspectives. in the last few years, personalised Clothes Recommendation has received a great attention [216]- [219]. However, the problem is that, many of these methods are not able to deal with the cold-start [220] for new users. There is a few number of papers that focus on this issue.
In the work developed by Bracher et al. [221], the novelty is on a latent space representation, known as FashionDNA, created for fashion items.
In [222], the authors combined the categories and styles with the clothes using visual representations. These two approaches wanted to overcome the problem of cold starting when new elements were inserted for the fashion recommendation task, but they completely neglected the characteristics of the users. In the works carried out by Piazza et al. [222] VOLUME 10, 2022 and Sun et al. [223], there is the aim of overcoming the coldstart problem by using the information provided by users.
The approach set by Verma et al. [224] to overcome this problem is instead practical: to understand what the preferences of a new user are, they used a limited set of images of preferences by exploiting forecasting in the fashion world.
Within each outfit it is possible to extract different types of concepts, both in terms of style and product design [225] which are of a high-level. Some works [226] also show that even if they are considered low-level concepts, then they can always be transformed into high-level concepts. For this purpose, several approaches based on computer-vision have been investigated for representing clothing items from images.
Algorithms based on deep neural networks have reached new heights in terms of performance, also thanks to the possibility of having rich fashion databases such as DeepFashion [20] or ModaNet [38].
Liu et al. [20] used a branched CNN architecture, Fashion-Net. This network, by being able to predict both attributes and landmarks simultaneously, can learn the characteristics of clothing.
Ma et al. [52] used learning system that integrates Bi-LSTMs with a CNN backbone, contextualizing it in fashion domain.
SyleNet was introduced by Yan et al. [227]. This is a network that creates clothing representation using a multitask representation learning, and it incorporates different fashion concepts.
In recent years, multi-task learning provided significant results in vary applications. For this reason, Verma et al. [224] decided to exploit the capabilities of multi-task learning in the prediction of fashion concepts, modeling the dependencies between clothing categories and attributes. In their work, They also proposed a dataset consisting of 2 893 images in a high quality resolution sourced from Instagram and Pinterest.
Many other datasets have been built for the outfit recommendation task from Social Networks. Zheng et al. [228] created a dataset with images that comes from Lookbook.nu, a social media where users freely post their photos representing their own outfit. In total they selected 2 293 profiles, that had no more than 7 000 follower, and for each of these they took the 100 most recent photos, also downloading the photo caption and the corresponding hashtag.
Verma et al. [229] constructed occasion-oriented fashion knowledge dataset that consists of images downloaded from Instagram and Pinterest, and manually annotated as described in [224].
Also Lin et al. [230] formulated the fashion outfit recommendation problem as a Multiple-Instance Learning (MIL) problem, developing a new network called OutfitNet. The process of this network is divided in two phase. The first step is learning about the compatibility between fashion items and is done through a Fashion Item Relevancy network (FIR) and furthermore generates a relevance embedding of fashion items; the second phase is learning user preferences through a Outfit Preference Network based on visual information.
Jo et al. [231] proposed a system that recommends fashion designs that match target scenarios or natural landscapes. When a user inputs a query that describes the target scenery, a set of candidate images related to the query is collected in a keyword-based matching manner. Then, the collected image set is used to automatically generate new fashion images using a cross-domain generative adversarial network.
The work of Jo et al. [232] focuses on developing intelligent modem methods for Sketch-Product and personalized voting, consisting of a Sketch-Product mode retrieval model that overcomes the limitations of a text-based search approach. The sketch-product mode retrieval model works as follows: A user sketches a fashion product he or she wants, which is extrapolated to the level of a product image using a GAN; the GAN derives the attributes of the extrapolated product image as vector values and examines the vector space for comparable images. The vector-based user preference model Clothes Recommendation works as follows: A profile obtained by professional filtering based on a DNN is pre-trained and set as the base weighting value of the recommendation model, and customized fashion trends are recommended as the weighting values for the individual are learned over time based on the preferred fashion profiling.
Tangseng and Okatani [233] have proposed a system that is not only able to assess and quantify the goodness/badness of an outfit, but also provide a rationale for the prediction. This system receives images of several items that make up an outfit as inputs and then calculates a score that quantifies the goodness/badness of the outfit. For this purpose, each image is represented as a combination of characteristics that are easily interpreted by the user: in this way the score can be better understood by identifying the most influential characteristics. Given an outfit as a set of items, the system extracts the edge image and the main colors of each item. The edge image is forward propagated through a pre-trained CNN, then the output and main colors are forward propagated through a series of concatenation and fully linked layers using ReLU to obtain the score. The system also calculates the gradient of the score with respect to the representation of each element by backpropagation. The gradients are multiplied by the corresponding features to obtain the Item Feature Influence Value (IFIV).
Li et al. [234] presented a Clothes Recommendation approach that models the compatibility among diverse fashion products. The method is based on a category-aware metric learning framework that embeds the fashion articles so that the cross-category compatibility notions among the items are learned while maintaining the intra-category variety among them. Their system mainly consists of three parts: a pre-trained CNN for visual feature extraction; a category complementary relation embedding space for modeling category-aware compatibility; a multiple relation-specific projection spaces for preserving the intra-class diversity. Li et al. [235] use the hierarchical graph network to describe the relationships between the users, the proposed outfits and the items within the images: the new framework Hierarchical Fashion Graph Network (HFGN). They assigned a ID embedding to each user/outfit and they used the visual features to represent each item. To update the outfit representation and refine the user's representations, they used the information sharing rule that can also manages to trace the historical outfit. Moreover, a joint learning scheme was proposed to perform compatibility matching and outfit recommendation simultaneously.
Liu et al. [236] proposed neural graph filtering framework that allows to enable the flexible and diverse fashion collocation. The innovation is that it can accepts inputs and outputs with different lengths and it can recommends various styled fashion collocations. Lastly, it also manages very well the datasets that are unbalanced.
The work of Yu et al. [237] is an extension of their earlier work [238], in which they demonstrate that the aesthetic part is very important in modeling and predicting users' preferences, especially for some fashion related domains, and it is important to modeling the aesthetic information. Through a deep neural network, they managed to extract the aesthetic characteristics from the images and subsequently incorporated them into the recommendation system. Then, the aesthetic features were given as input to the basic tensor model: this phase was done to apprehend the temporal preferences. In addition, they investigated aesthetic features in negative sampling to obtain further benefits in recommendation tasks.
The work conducted by Zhan et al. [239] has the purpose of predicting user preferences through a Attentive Attribute-Aware Fashion Knowledge Graph called A3-FKG, which is used to establish a relationship between multiple outfits considering outfit-level and product-level attributes. Moreover, it was developed a mechanism composed by two attention layers which is used to understand the preferences of each user, the first with the task of capture the user's fine-grained preferences, and the second with the task of The first attention layer consists of the user-specific relations-aware, which captures the user's fine-grained preferences with different focus on relations for learning the outfit representation; the second focuses on the target-aware.
The table 20 considers the algorithms of clothes recommendation used for different applications.

VI. DISCUSSIONS AND OPEN QUESTIONS
We close this paper by returning briefly to the questions raised at the beginning, which remain largely open.
For what tasks is fashion data used and how has the use of this data developed over time? Initially, the research in the world of fashion aimed only at the classification of images of people with certain items of clothing that could be annotated or not. We then moved on to the search for landmark points to facilitate the detection of clothes. With the advent of GAN, there has been a particular focus on Clothes Synthesis whose task is to dress a model in a certain arbitrary pose. At the same time with the continuous growth of the power of social networks, research on fashion has also focused on the extraction of knowledge within images that came from social networks, such as Instagram.
Comparing machine and deep learning methods, does the fashion data influence the choice of using one methodology rather than another? As for the most used methods in the fashion world, we can say that deep learning approaches prevail over machine learning ones. In fact, as we can also see in the tables present in the various sections, convolutional neural networks are the basis of most of the methods developed. The datasets that exist in the fashion world greatly influence the choice between deep learning and machine learning methods. In fact they are very competitive and very difficult to face. For this reason, deep learning methods are more common.
What are the future applications that need to be developed and deepened that use fashion data? Improve the way the customer can discover a product. For example, develop methods that allow you to buy online simply by taking or inserting a photo: the site should return that article, or at least an article similar to the one entered.
Do research on Clothes Recommendation. Fashion brands have the necessity to predict in a better way the preferences of the customer by collecting and investigating shopper behavior, customer profile and customer feedback. Based on this information in conjunction with deep learning techniques allows fashion retailers to provide a customized choice of clothing for customers.
In this context to have a large scale annotation of data is very important. Having a huge amount of data concerning fashion and beauty, the generation of a precise annotation to decrease the cost and maintaining the quality is a challenging problem. Hence, a major effort in the growth of cost-effective annotations approach on fashion and beauty connected to data is required to deal with the problem.
How has social media changed the marketing strategies of fashion brands? Social networks have had and continue to have a very strong influence in the fashion world, completely rewriting the rules of marketing. It is on social channels that trends in the fashion sector, consumer behavior are studied and it is always on social media that new influencers VOLUME 10, 2022 and Brand Ambassadors come out every day. If before the TV channel and the physical store were the best way to promote a brand, today they are obsolete and social networks, e-commerce and influencers prevail. The real protagonist in this area is the consumer, who participates at 360 • from the product creation phase to the launch and promotion phase. That's why it is necessary to rethink marketing strategies and adapt them to social networks. Therefore, It is necessary to create a close and deep relationship of trust with the user. To build customer loyalty and enhance the connection with the Brand, companies can use different digital strategies that build a new customer experience: • Mobile commerce: through social networks and apps, the company can create direct contact with the customer wherever they are; • Brand Ambassador: represents the image of the Brand. He knows the mission, vision, values and goals and uses them to involve his followers and guide them in the purchasing process; • Big data: Big data Analysis is used to perfect Customer care, help predict trends, develop new market strategies, all with much more precise results and in minimum times;

VII. LIMITATIONS AND LESSON LEARNT
Exploiting Deep Learning for the interpretation of fashion social media data is challenging, since we deal with the management of heterogeneous data, the different scales of representation, and the purpose of data processing. However, the aforementioned challenges related to application can be categorised as follows: • Lack of Available Dataset: Regardless of the topic and/or the kind of data in the training phase (given the assumption that DL models can be arranged to fit a specific task), there is a lack of available datasets in the literature to be used as benchmarks. It is well known that DL are data-driven techniques that perform better as the number of input samples increases. Attempts to solve this problem have involved the generation of synthetic datasets. Recently, generative models have proven to be effective for this task. Generative adversarial networks (GANs) are an appealing DL approach developed in 2014 by Goodfellow [162].
• Domain Dependent Models: Regarding its respective fashion compartment, when there is no all-in-one solution for every task, each AI-based model should be chosen according to the task one is attempting to solve. In other words, as AI improves, the need has emerged to understand how to make such models effective, choosing them according to the kind of data for which they have been designed. Integrating the knowledge of domain expert into AI models increases the reliability and the robustness of algorithms, making decisions more accurate. Moreover, the knowledge acquired for one task can be use to solve related ones thanks to transfer learning strategies.
• Hardware Limitations: despite the growing computational capabilities of better-performing CPUs and the advances in distributed and parallel high performance computing (HPC), the computational costs of the abovementioned tasks remain high. We are not still at a stage where the ratio between time/gained and resources/spent is in balance, making the use of DL-based methods unhelpful at times compared with time-consuming but more affordable manual solutions.

VIII. CONCLUDING REMARKS
Valued at over 3 trillion dollars, the global fashion industry contributes to a healthy 2% of the global Gross Domestic Product (GDP). For this reason, fashion companies are increasingly trying to invest in the world of artificial intelligence to be able to satisfy the customer 100%. In particular, social media have long since changed the way of perceiving the world of fashion by the costumers: in this context social networks are fundamental communication tools, in particular Facebook and Instagram. Above all, the Instagram social network has become of fundamental importance for companies as the influencer sponsoring products is paid by companies to influence consumer preferences. For this reason, this review aims to summarize the datasets that have been collected and the methods that have been used in deep learning in the fashion sector, and in particular in social networks. Methods and techniques for each kind of fashion task have been analysed, the main paths have been summarised, and their contributions have been highlighted. This review offers rich information and improves the understanding of the research issues related to the use of AI with social media fashion data. Furthermore, it is informative on how and if DL techniques and methods could help the development of applications in various fields. This work thus paves the way for further research in the domain. Future research directions include the improvement of the algorithms to use other comprehensive features, thereby achieving better performance.

MARINA PAOLANTI is currently an Assistant
Professor with the Department of Political Sciences, Communication and International Relations, University of Macerata, and an Adjunct Professor with the Department of Information Engineering (DII), Università Politecnica delle Marche. During her Ph.D., she worked with GfK Verein, Nuremberg, Germany, for visual and textual sentiment analysis of brand-related social media pictures using deep convolutional neural networks. Her research focuses on artificial intelligence and computer vision, with particular focus to specialized machine learning algorithms and deep learning architectures.

EMANUELE FRONTONI is currently a Full
Professor of computer science with the University of Macerata and the Co-Director of the VRAI Laboratory, Department of Information Engineering (DII), Università Politecnica delle Marche. He coordinated and participated in several industrial research and development projects in collaboration with ICT and mechatronics companies in the field of ambient assisted living. His research interests include computer vision and artificial intelligence with applications in robotics, video analysis, and human behavior analysis, and the automatic classification of images. He is also involved in several e-health projects in the field of data interoperability, cloud-based technologies, and big data analysis. He is a member of the European Association for Artificial Intelligence, the European AI Alliance, and the International Association for Pattern Recognition.
PRIMO ZINGARETTI (Senior Member, IEEE) is currently a Full Professor of computer science with Universitá Politecnica delle Marche, Italy. He has authored over 150 scientific research articles in English. His main research interests include artificial intelligence, robotics, intelligent mechatronic systems, computer vision, pattern recognition, image understanding and retrieval, information systems, and e-government. Robotics vision and geographic information systems have been the main application areas, with great attention directed to the technological transfer of research results. He is a member of ASME and GIRPR-IAPR, and a Co-Founder of AI*IA.