Introduction
Choosing the appropriate clothes to wear every morning is an arduous task for many of us. Various factors such as avoiding a repetition of the same clothes on consecutive days and wearing one’s favorite clothes frequently are taken into account while choosing one’s outfit. In addition, suitable combinations need to be considered when buying new clothes. Fashion applications aid users in making such decisions.
Fashion application development is an emerging research field, and fashion recommendation systems have become a topic of significant interest [1]–[5]. However, a recommendation system has to consider many variables such as the types of clothes the user owns, how frequently these clothes are worn, and personal preferences to satisfy the above requirements.
If everyday outfits are recorded manually by a user, the frequency of choosing to wear the same clothes can be ascertained. However, it is a tough task to maintain this record every morning, especially with hectic schedules. To the best of our knowledge, there is no system to record this data automatically.
In this study, we propose a system that automatically “aggregates” the everyday outfits of users and displays how frequently a user wears each outfit (Fig. 1) without manual recording. Here, the term aggregate stands for collecting pieces of clothing and count each of them. The system shows users the aggregation results whenever it captures a piece of clothing, and the users can correct the results manually in case of misrecognition. We call this correction user feedback. Recently, smart speakers, such as Google Home [6], have seen increased usage, and some of the systems such as Echo Look [7] have in-built cameras. It is expected that cameras installed in smart speakers will become more popular in the coming days. In the proposed system, we assume that users are observed every morning by such cameras mounted in their rooms, and these observations are used as input images for the proposed system. Surveillance cameras that are installed in entrances or rooms can also be used.
By applying the incremental clustering algorithm [8] to detected clothing images from a camera, the system can aggregate clothing and count them every day. For reducing the frequency of user feedback, the clustering should be highly accurate. For accurate clustering, appropriate image features that have discriminative power are expected to be extracted from clothing images. A discriminative feature space can be trained using a large number of diverse clothing images. We call the feature space the generic clothing feature space. However, clothing preferences vary from person to person; the clothing variation of a user is much smaller than the clothing variations in the world. Therefore, there is a gap between the clothing feature space of each specific user and the generic clothing feature space, as shown in Fig. 2. It would cause the generic clothing feature space not to be appropriate for each user.
Example of the gap between the generic clothing feature space and user-specific clothing feature spaces.
To fill the gap, we introduce an interactive user adaptation method based on user feedback. By fine-tuning a feature extractor using user feedback, the generic clothing feature space will be transferred to the user-specific clothing feature space. Better clustering can be expected in the user-specific clothing feature space. In the proposed system, we assume that this procedure is run on a cloud server whenever the user feedback is provided; thus, the clothing feature space is updated on demand.
The contributions of this study can be summarized as follows:
Everyday outfits aggregation system: we propose an automatic aggregation system of clothing images every day based on incremental clustering with user feedback.
Interactive User Adaptation: we propose an interactive user adaptation method that adapts the generic clothing feature space to a user-specific clothing feature space based on user feedback.
Evaluation using a real clothing dataset: we collected a dataset of about 7,000 images by taking photos of twelve users every morning for a month.
The rest of this paper is organized as follows: In Section II, recent work on fashion analysis and recommendation is summarized. In Section III, the details of the proposed clothing aggregation system are discussed. In Section IV, the proposed interactive user adaptation method is explained in detail. Experimental results are presented in Section V. Finally, we conclude the paper in Section VI.
Related Work
Fashion analysis and recommendation are extensively studied topics in the field of information retrieval, computer vision, and image processing [9]–[13]. Recently, the usage of clothing recommendation systems [2], [14]–[16] that match users’ preferences has become a trend on e-commerce websites. Yu et al. [15] have proposed a recommendation system that focuses on aesthetics, which is highly relevant to user preference. Abe et al. [17] collected large datasets of images from YFCC100M [18] and have analyzed clothing trends with regard to geolocation, especially for metropolitan areas. The method proposed by He and McAuley [19] also focuses on the trend of fashion.
Many studies have focused on selecting clothing at home, and several user interfaces to support selecting clothing at home have been proposed [20], [21]. When a user is selecting a piece of clothing, the systems show several recommending pieces of clothing to the user. McAuley et al. [22] proposed a recommendation system using a query image. Tangseng et al. [23] proposed a recommendation system for clothing in a closet. These existing clothing recommendation systems require registering all the clothing that the user owns to the closet database and also requires recording everyday outfits manually. This is a time-consuming process.
Recently, the fashion generation is also been a hot topic in this field [24]–[29]. There are various methods to generate fashion images by using GANs that have been proposed. For training a GANs model and generate user-specific fashion images, it is also required to collect a large-scale everyday outfits of the user beforehand.
The Proposed Clothing Aggregation System
A. Overview
The goal of the proposed system is to automatically aggregate clothing items usually worn by a user and to determine how frequently the user wears them. By incrementally clustering clothing photos taken every morning, the system realizes the aggregation and calculates the frequency. The process flow of the proposed method is shown in Fig. 3. We assume that each photo is taken at the user’s room entrance every morning when the user goes out. Therefore, we can assume that the photos are taken in the same environment, and the users in the photos are in similar postures.
Process flow of the proposed system: feature extraction and incremental clustering are applied to cropped images. The user provides corrections as user feedback if the clustering results are incorrect.
B. Human Detection and Pose Estimation
Human detection and pose estimation are applied to extract the regions of clothing from photos. As a result, the locations of the joints of the human body are obtained. An example result of the pose estimation is shown in Fig. 4.
Example result of human pose estimation: the red rectangle indicates the extracted part of the upper body of the person.
Although there is a possibility that multiple people can be detected in a photo, we assume that only one person is detected in all photos. It is because we assume the camera is installed at the user’s room entrance.
C. Cropping of the Clothing Region From an Image
The proposed system crops clothing regions based on the locations of body joints obtained by the process described in Section III-B.
For example, given the human pose, an image of clothing for the upper body is cropped based on the minimum bounding box of the body joints of the upper body. We define the upper body by the locations of shoulders, elbows, and hip. For a lower body image, we use the locations of hip, knees, and ankles. Examples of the cropped images are shown in Fig. 5. The following processes are applied independently to these upper and lower body images.
D. Feature Extraction
To cluster clothing images accurately, we need an appropriate feature space. Here, we build a discriminative feature space using diverse kinds of clothing, named the generic clothing feature space. We train a feature extractor that maps an input image to a vector in the generic clothing feature space.
We use a convolutional neural network (CNN) as the image feature extractor. Pre-trained ResNet [30] models are often used as the backbones of feature extractors. The pre-trained ResNet model is trained using the ImageNet dataset [31], which consists of various images of a large number of classes. Therefore, it is not suitable for capturing clothing features accurately as is. Therefore, we combine a trainable fully-connected (FC) layer that outputs a \begin{equation*} \mathbf {f}_{t} = f(I_{t}; \widehat {\Theta }),\tag{1}\end{equation*}
The purpose of the feature extraction here is clothing clustering. Thus, we train the model as metric learning to tune the similarity metric of the features.
For metric learning, we employ the contrastive loss [32] with a Siamese network, which is widely used in person re-identification [33]–[35]. The clothing feature space is optimized by using the loss. The loss calculation is shown in Fig. 6. When two given pieces of clothing are the same instance, they should have similar features, and when two given pieces of clothing are different, they should have different features. The contrastive loss using the cosine similarity \begin{align*} L(\mathbf {f}_{1}, \mathbf {f}_{2})=&\begin{cases}1 - s(\mathbf {f}_{1}, \mathbf {f}_{2}),& (\text {same})\\ \max (M - (1 - s(\mathbf {f}_{1}, \mathbf {f}_{2})), 0),& (\text {different})\\ \end{cases}\qquad \tag{2}\\ s(\mathbf {f}_{1}, \mathbf {f}_{2})=&\frac {\mathbf {f}_{1}^\top \mathbf {f}_{2}}{\|\mathbf {f}_{1}\|\|\mathbf {f}_{2}\|},\tag{3}\end{align*}
Contrastive loss calculation. Given two image features extracted by the CNN independently, the contrastive loss is calculated. Parameters of the last FC layer indicated by the red rectangle are optimized while those of the Resnet is frozen. The output of the last FC layer will be used as a generic clothing feature.
E. Incremental Clustering of Clothing Images
The system captures a photo every day, crops a clothing image of the user, and stores the clothing image to the database. The number of clothing items remains unknown, and it will change when the user buys new clothes or discards old ones. Therefore, we use the incremental clustering algorithm [8] in the system.
In incremental clustering, the set of cluster centers \begin{equation*} i_{t}=\arg \max _{i} s(\mathbf {f}_{t}, \mathbf {c}_{i}).\tag{4}\end{equation*}
When the maximum similarity
Examples of the incremental clustering procedure: the small white circle indicates an incoming data, and colored ones indicate existing data. The colored dashed circles indicate existing clusters.
Finally, the system counts the number of samples for each cluster and outputs the results.
Proposed Interactive User Adaptation
As discussed in Section I, since there is a gap between the generic clothing feature space and appropriate feature space for a specific user, we introduce an interactive user adaptation using user feedback. By utilizing the feedback, the generic clothing feature space will be transferred to the user-specific clothing feature space.
A. Overview
The generic clothing feature space is defined by metric learning, as described in Section III-D. In the generic clothing feature space, diverse clothing features can be discriminated against. However, the set of clothing of users is small subsets of the diverse clothing. Owing to preference, clothing items of a user are often similar. Therefore, their features would distribute locally in the generic clothing feature space, and it is difficult to distinguish the features of the clothing of the specific users in the generic clothing feature space. Fig. 2 shows an example of the relation between the generic clothing feature space and the clothing feature spaces of several users. In Fig. 2, person A prefers red clothing, and person B prefers clothing face printed on them. Both of these sets are small subsets of the clothing in the world.
To fill the gap between the generic clothing feature space and a user-specific clothing feature space, we utilize user feedback, as shown on the right side of Fig. 3, to update the metric of the feature space.
B. User Feedback
In the process flow of the clothing aggregation system (Fig. 3), whenever the system receives an input image and performs clustering, users are expected to confirm the clustering results. If a user finds that the label assigned to the input of the day is wrong, he/she is expected to modify the result to the correct label. Since the clustering algorithm is the incremental clustering described in Section III-E, the user only needs to confirm the label of the latest input. If the user modifies the clustering result every day, all of the results will be stored correctly.
For modifying the wrong label, the system shows the image of each cluster center, and then the user selects the correct cluster that the current clothing should belong to. If there is no cluster for the current clothing, the user needs to create a cluster for it.
Since confirming whether the assigned label is correct or not (confirmation) is an easy task, it would be acceptable for users. On the other hand, since selecting the correct label (modification) is a cumbersome task, it should be a rare event. The frequency of label correction depends on the performance of the clustering method. Therefore, user adaptation is performed to improve clustering performance.
C. Feature Extractor Updating and Feature Re-Extraction
We propose a method that transforms the feature space to a user-specific clothing feature space to distinguish the clothing of users clearly (Fig. 8).
For transforming the generic clothing feature space to a user-specific clothing feature space, the feature extractor is modified by using user feedback. Since the label assignment is expected to be corrected by the user feedback, we use the assigned labels by the incremental clustering as the ground truth for updating the feature extractor. The process flow of the proposed user adaptation is shown in Fig. 9.
Since the feature extractor is trained based on the Siamese network explained in Section III, we need a set of pairs of clothing as training data for the user adaptation procedure.
Here, we assume that the system stores all of the cropped clothing images on a cloud server. After updating the feature extractor, all features of the clothing of the user are updated by re-extracting the features using the feature extractor. This update procedure is run on the cloud server whenever a user feedback is provided.
Evaluation
A. Dataset
To evaluate the proposed system, we needed a dataset containing images of the everyday outfits of several persons. To the best of our knowledge, there are no such public datasets; hence, we collected a dataset of photos of everyday outfits. The dataset contains photos of twelve subjects. Each subject captured photos of their own outfits every day for about a month, while they selected their own outfits, as usual everyday. The outfits of each subject are selected by themselves everyday. Here, we assume that the subjects do not change their outfits within a day. The numbers of pieces of their outfits vary from 9 to 26. We assume that a camera is installed at the entrance of the room of each subject. Therefore, the orientations of subjects are similar, and the background and the illumination are almost the same within photos of each user. Whenever the subjects captured their outfits, they took about twenty photos in different postures. The total number of photos is 7,456. Samples of the photos are shown in Fig. 10.
B. Evaluation Settings
Since the clustering results depend on the order of the input sequence, we simulated that a subject selects a piece of clothing every day for
We performed leave-one-person-out cross-validation. For each subject, we trained the generic clothing feature space using the clothing of the rest of the subjects. The weights of the ResNet were then frozen, and only the weights of the FC layers were updated. Here, ResNet-50 is used as the backbone.
For human body detection and pose estimation, we used OpenPose, proposed by Cao et al. [36]. For feature extraction, we set the dimension of feature vectors as
As an evaluation metric for clustering, we used the adjusted Rand index (ARI) [38], which is commonly used for evaluating clustering results. The ARI is a metric that compares two label assignments. Here, we assume that the assigned labels by clustering algorithms and the ground truth labels are given for each photo. The ARI score achieves the highest value 1.0, when both of the label assignments are the same, and it can be lower than zero if the label assignments are almost different. Let us assume label sets \begin{align*}&\hspace {-1.2pc} \text {ARI} (\mathcal {X},\mathcal {Y}) \\=&\frac {\sum _{i}\sum _{j}\left ({\!\begin{array}{c}n_{ij}\\ 2\end{array}\!}\right)\!-\!\left [{\sum _{i}\left ({\!\begin{array}{c}n_{i.}\\ 2\end{array}\!\!}\right)\sum _{j}\left ({\!\begin{array}{c}n_{.j}\\ 2\end{array}\!\!}\right)}\right]\Big /\left ({\!\begin{array}{c}n\\ 2\end{array}\!}\right)}{\dfrac {1}{2}\!\!\left [{\sum _{i}\left ({\!\!\begin{array}{c}n_{i.}\\ 2\end{array}\!\!}\right) \!+ \! \sum _{j}\left ({\!\begin{array}{c}n_{.j}\\ 2\end{array}\!}\right)}\right]\!-\!\left [{\sum _{i}\left ({\!\begin{array}{c}n_{i.}\\ 2\end{array}\!}\right)\sum _{j}\left ({\!\begin{array}{c}n_{.j}\\ 2\end{array}\!}\right)}\right]\!\Big /\!\left ({\!\begin{array}{c}n\\ 2\end{array}\!}\right)}, \!\! \\ {}\tag{5}\end{align*}
C. Evaluation of the Number of User Feedback
Firstly, we evaluated the performance of the system when a user uses it for a year (
We compared the proposed method with a method that does not update the feature space (without adaptation). The method (without adaptation) only modifies wrong labels by the user feedback and never updates the feature space.
Table 1 shows the average number of user feedback in a year (
D. Clustering Performance Against the Number of User Feedback
We evaluated the clustering performance when the maximum number of user feedback is limited. We limited the maximum number of user feedback
We compared the performance of the proposed method to the situation with no user adaptation. As a comparative method, we used a method (without adaptation) that modifies the clustering label but does not update the feature extractor; that is, it does not adapt to the user. We also compare with a method that does not use any user feedbacks (No user feedback).
By comparing to the method with only label modification (without adaptation), we confirmed that the proposed interactive user adaptation achieved higher accuracy.
To investigate the effects of user feedback, we show the transition of the ARI scores. The results are summarized in Table 2, and visualized in Fig. 11.
Transition of the adjusted Rand index score for 365 iterations: the difference between these lines is the maximum number of user feedback.
From Fig. 11, it is confirmed that more user feedback gives more accurate results. At the beginning of the steps, the ARI score shows noisy behavior because of the small number of samples. We can see that the ARI score rapidly drops when the number of feedbacks reaches the limit, in case the feature extractor is not updated. In addition, we can see that the ARI score decreases if the number of user feedback is small. However, the ARI scores decrease slowly and are high until the end of the iterations. In case the limit of the number of user feedback is 100, the number of user feedback does not reach the limit even upon 365 iterations for the proposed method.
E. Clustering Performance on Known Data
We evaluated the clustering performance after the system had adapted to a user, where all the clothing instances of the user are known. First, we ran the system for 365 iterations with user feedback (without limitation on the maximum number of user feedback). The system adapted suitably to the user. Then, we additionally ran the system for 365 iterations without user feedback and evaluated the clustering performance at the last iteration. Here, in the additional 365 iterations, it is assumed that all the clothing is known; that is, there is no new clothing instance in the 365 iterations.
To evaluate the effectiveness of the user adaptation, we compare the two methods whose feature extractors are updated or not in the former 365 iterations. We evaluated the performance by using the ARI score.
The results are shown in Fig. 12. As shown in Fig. V-E, the system with user adaptation (updating the feature extractor) exhibited high performance in the latter 365 iterations.
Transition of the adjusted Rand index score for
F. Visualization of the Feature Spaces
To investigate the effect of the user adaptation method, we visualized the generic and a user-specific clothing feature space. After 365 iterations of clustering with and without user adaptation, we visualize the features by applying t-distributed Stochastic Neighbor Embedding (t-SNE) [39] to obtain two-dimensional spaces. In Fig. 13, different pieces of clothing are indicated in different colors. By the proposed method (Fig. 13 (i)), the overlapping of the feature points is reduced, and features of each piece of clothing are gathered closer while the distance between each cluster is larger. It makes clustering easier and much accurate. By these observations, we confirmed that the proposed method effectively adapts the feature space to the user’s clothing.
t-SNE visualization results of clothing feature spaces of a user with and without user adaptation after 365 iterations: different colors indicate different pieces of clothing.
G. Clustering Performance When the User Feedback is Not Always Correct
Users are prone to making mistakes such as forgetting to provide user feedback or providing wrong feedback. In such cases, mislabeled samples may be included in the clustering results.
To evaluate the robustness against mislabeled samples, we evaluated the clustering performance by changing the ratio of correct user feedback among 0.25, 0.50, 0.75, and 1.00.
The ARI scores after 365 iterations are shown in Table 3. As a result, even if the user feedback included wrong labels, the system worked well. Even if 50% of the user feedback was incorrect, the system kept the clustering performance at about 0.7 in the ARI score. Therefore, even if the user made several mistakes in the feedback, it is considered that the mistakes do not significantly affect performance.
Conclusion
In this paper, we proposed a system that can automatically aggregate clothing and visualize the frequency at which a person wears them by only observing the person using a monitoring camera at home. Since there is a gap between the clothing of a person and those available in the world, it is not suitable to cluster the clothing of the user using the generic clothing features extracted by a feature extractor trained using a large number of clothing photos. To fill the gap, we proposed an interactive domain adaptation using user feedback.
Through the evaluation of a real-world dataset, we confirmed that the proposed interactive domain adaptation achieved higher clustering performance and reduced the amount of user feedback.
Creating a sophisticated user interface to make the user feedback procedure easier will be considered in a future work. In this research, we assumed that each user’s clothing set is not changed during the year, including the seasonal trends of clothing. The users may dispose their clothing and buy new one. We need to consider these changes in the dataset. Since image generation techniques are actively developed recently, those methods can be used to enhance the number of clothing and people in the dataset. Evaluation on such an enhanced dataset also can be in future work.