DeepStyle: Multimodal Search Engine for Fashion and Interior Design

In this paper, we propose a multimodal search engine that combines visual and textual cues to retrieve items from a multimedia database aesthetically similar to the query. The goal of our engine is to enable intuitive retrieval of fashion merchandise such as clothes or furniture. Existing search engines treat textual input only as an additional source of information about the query image and do not correspond to the real-life scenario where the user looks for 'the same shirt but of denim'. Our novel method, dubbed DeepStyle, mitigates those shortcomings by using a joint neural network architecture to model contextual dependencies between features of different modalities. We prove the robustness of this approach on two different challenging datasets of fashion items and furniture where our DeepStyle engine outperforms baseline methods by 18-21% on the tested datasets. Our search engine is commercially deployed and available through a Web-based application.


I. INTRODUCTION
M ULTIMODAL search engine allows to retrieve a set of items from a multimedia database according to their similarity to the query in more than one feature spaces, e.g. textual and visual or audiovisual. This problem can be divided into smaller subproblems by using separate solutions for each modality. The advantage of this approach is that both textual and visual search engines have been developed for several decades now and have reached a certain level of maturity. Traditional approaches such as Video Google [2] have been improved, adapted and deployed in industry, especially in the ever-growing domain of e-commerce. Major online retailers such as Zalando and ASOS already offer visual search engine functionalities to help users find products that they want to buy [3]. Furthermore, interactive multimedia search engines are omnipresent in mobile devices and allow for speech, text or visual queries [4], [5], [6].
Nevertheless, using separate search engines per each modality suffers from one significant shortcoming: it prevents the users from specifying a very natural query such as 'I want this type of dress but made of silk'. This is mainly due to the fact that the notion of similarity in separate spaces of different modalities is different than in one multimodal space. Furthermore, modeling this highly dimensional multimodal space requires more complex training strategies and thoroughly annotated datasets. Finally, defining the right balance between the importance of various modalities in the context of a user query is not obvious and hard to estimate a priori. Although several multimodal representations have been proposed in the Fig. 1. Example of a typical multimodal query sent to a search engine for fashion items. By modeling common multimodal space with a deep neural network, we can provide a more flexible and natural user interface while retrieving results that are semantically correct, as opposed to the results of the search based on the state-of-the-art Visual Search Embedding model [9].
context of a search for fashion items, they typically focus on using other modalities as an additional source of information, e.g. to increase classification accuracy of compatible and noncompatible outfits [7].
To address the above-mentioned shortcomings of the currently available search engines, we propose a novel end-toend method that uses neural network architecture to model the joint multimodal space of database objects. This method is an extension of our previous work [8] that blended multimodal results. Although in this paper we focus mostly on the fashion items such as clothes, accessories and furniture, our search engine is in principle agnostic to object types and can be successfully applied in many other multimedia applications. We call our method DeepStyle and show that thanks to its ability to jointly model both visual and textual modalities, it allows for a more intuitive search queries, while providing higher accuracy than the competing approaches. We prove the superiority of our method over single-modality approaches and state-of-theart multimodal representation using two large-scale datasets of fashion and furniture items. Finally, we deploy our DeepStyle search engine as a web-based application.
To summarize, the contributions of our paper are threefold: • In addition to the results using blending methods from multiple modalities, we propose a novel multimodal endto-end search engine based on a deep neural network architecture. It is robust to domain changes and outperforms arXiv:1801.03002v1 [cs.CV] 8 Jan 2018 the state-of-the-art methods on two diversified datasets by 18 and 21% respectively. • Our system is deployed in production and available through a Web-based application. • Last but not least, we introduce a new interior design dataset of furniture items offered by IKEA, an international furniture manufacturer, which contains both visual and textual meta-data of over 2 000 objects from almost 300 rooms. We plan to release the dataset to the public. The remainder of this work is organized in the following manner. In Sec. II we discuss related work. In Sec. III we present a set of methods based on blending single-modality search results that serve as our baseline. Finally, in Sec. IV, we introduce our DeepStyle multimodal approach as well as its extension. In Sec. V we present the datasets used for evaluation and in Sec. VI we evaluate our method and compare its results against the baseline. Sec. VIII concludes the paper.

II. RELATED WORK
In this section, we first give an overview of the current visual search solutions proposed in the literature. Secondly, we discuss several approaches used in the context of a textual search. We then present works related to defining similarity in the context of aesthetics and style, as it directly pertains to the results obtained using our proposed method. Finally, we present an overview of existing search methods in fashion domain as this topic is gaining popularity.

A. Visual Search
Traditionally, image-based search methods drew their inspiration from textual retrieval systems [10]. By using k-means clustering method in the space of local feature descriptors such as SIFT [11], they are able to mimic textual word entities with the so-called visual words. Once the mapping from image salient keypoints to visually representative words was established, typical textual retrieval methods such as Bagof-Words [12] could be used. Video Google [2] was one of the first visual search engines that relied on this concept. Several extensions of this concept were proposed, e.g. spatial verification [13] that checks for geometrical correctness of initial query or fine-grained image search [14] that accounts for semantic attributes of visual words.
Successful applications of deep learning techniques in other computer vision applications have motivated researchers to apply those methods also to visual search. Although preliminary results did not seem promising due to the lack of robustness to cropping, scaling and image clutter [15], later works proved potential of those methods in the domain of image-based retrieval [16]. Many other deep architectures such as Siamese networks were also proposed, and proved successful when applied to content-based image retrieval [17].
Nevertheless, all of the above-mentioned methods suffer from one important drawback, namely they do not take into account the stylistic similarity of the retrieved objects, which is often a different problem from visual similarity. Items that are similar in style do not necessarily have to be close in visual features space.

B. Textual Search
First methods that proposed to address textual information retrieval were based on token counts, e.g. Bag-of-Words [12] or TF-IDF [18].
Later, a new type of representation called word2vec was proposed by Mikolov et. al [19]. The proposed models in word2vec family, namely continuous Bag of Words (CBOW) and Skip-Grams, allow the token representation to be learned based on its local context. To grasp also the global context of the token, GloVe [20] has been introduced. GloVe takes advantage of information both from the local context and the global co-occurrence matrix, thus providing a powerful and discriminative representation of textual data. Similarly, not all queries can be represented with a text only. There might be a clear textual definition missing for style similarities that are apparent in visual examples. Also, the same concepts might be expressed in synonymical ways.

C. Stylistic Similarity
Comparing the style similarity of two objects or scenes is one of the challenges that have to be addressed when training a machine learning model for interior design or fashion retrieval application. This problem is far from being solved mainly due to the lack of a clear metric defining how to measure style similarity. Various approaches have been proposed for defining style similarity metric. Some of them focus on evaluating similarity between shapes based on their structures [21], [22] and measuring the differences between scales and orientations of bounding boxes. Other approaches propose the structuretranscending style similarity that accounts for element similarity [23]. In this work, we follow [24], and define style as a distinctive manner which permits the grouping of works into related categories. We enforce this definition by including context information that groups different objects together (in terms of clothing items in an outfit or furniture in a room picture in interior design catalog). This allows us to a take data-driven approach that measures style similarity without using hand-crafted features and predefined styles.

D. Deep Learning in Fashion
There has been a significant number of works published in the domain of fashion item retrieval or recommendation due to the potential of their application in highly profitable e-commerce business. Some of them focused on the notion of fashionability, e.g [26] rated a user's photo in terms of how fashionable it is and provided fashion recommendations that would increase overall outfit score. Others focused on fashion items retrieval from online database when presented with user photos taken 'in the wild' usually with phone cameras [27]. Finally, there is ongoing research in terms of clothing cosegmentation [28], [29] that is an important preprocessing step for better item retrieval results.
Kiros et al. [9] present an encoder-decoder pipeline that learns a joint multimodal embedding (VSE) from images and a text, which is later used to generate text captions for custom images. Their approach is inspired by successes in Neural Machine Translation (NMT) and perceives visual and textual modalities as the same concept described in different languages. The proposed architecture consists of LSTM RNNs for encoding sentences, CNN for encoding images and structure-content neural language model (SC-NLM) for decoding. The authors show that their learned multimodal embedding space preserves semantic regularities in terms of vector space arithmetic e.g. image of a blue car -"blue" + "red" is near images of red cars. However, results of this task are only available in some example images. We would like to leverage their work and numerically evaluate multimodal query retrieval, specifically in the domain of fashion and interior design.
Xintong Han et al. [30] train bi-LSTM model to predict next item in the outfit generation. Moreover, they learn a joint image-text embedding by regressing image features to their semantic representations aiming to inject attribute and category information as a regularization for training the LSTM. It should be noted, however, that their approach to stylistic compatibility is different from ours in a way that they optimize for generation of a complete outfit (e.g. it should not contain two pairs of shoes) whereas we would like to retrieve items of similar style regardless of the category they belong to. Also, they evaluate compatibility with "fill-in-the-blanks" test that does not incorporate retrieval from the full dataset of items.
Only several example results are illustrated and no quantitative evaluation is presented.
Numerous works focus on the task of generating a compatible outfit from available clothing products [7], [30]. However, none of the related works focus on the notion of multimodality and multimodal fashion retrieval. Text information is only used as an alternative query and not as a complimentary information to extend the information about the searched object. Finally, research community has not yet paid much attention to define or evaluate style similarity.

III. FROM SINGLE TO MULTIMODAL SEARCH
In this section, we present a baseline style search engine model introduced in [8], which is the basis for our current research. It is built on top of two single-modal modules. More precisely, two searches are run independently for both image and text queries resulting in two initial sets of results. Then, the best matches are selected from initial pool of results according to blending methods -re-ranking based on visual features similarity to the query image as well as on contextual similarity (items that appear more often together in the same context).
For input, baseline style search engine takes two types of query information: an image containing object(-s), e.g. a picture of a dining room, and a textual query used to specify search criteria, e.g. cozy and fluffy. If needed, an object detection algorithm is run on the uploaded picture to detect objects of classes of interest such as chairs, tables or sofas. Once the objects are detected, their regions of interest are extracted as picture patches and run through visual search method. For queries that already represent a single object, no object detection is required. Simultaneously, the engine retrieves the results for a textual query. With all visual and textual matches retrieved, our blending algorithm ranks them depending on the similarity in the respective feature spaces and returns the resulting list of stylistically and aesthetically similar objects. Fig. 2 shows a high-level overview of our Style Search Engine. Below, we describe each part of the engine in more details.

A. Visual Search
Instead of using an entire image of the interior as a query, our search engine applies an object detection algorithm as a pre-processing step. This way, not only can we retrieve the results with higher precision, as we search only within a limited space of same-class pictures, but we do not need to know the object category beforehand. This is in contrast to other visual search engines proposed in the literature [17], [31], where the object category is known at test time or inferred from textual tags provided by human labeling.
For object detection, we used YOLO 9000 [25], which is based on the DarkNet-19 model [32], [25]. The bounding boxes are then used to generate regions of interest in the pictures and search is performed on the extracted parts of the image.
Once the regions of interest are extracted, we feed them to a pretrained deep neural network to get a vector representation. More precisely, we use the outputs of fully connected layers of neural networks pretrained on ImageNet dataset [33]. We then normalize the extracted output vectors, so that their L 2 norm is equal to 1. We search for similar images within the dataset using this representation to retrieve a number of closest vectors (in terms of Euclidean distance).
To determine the pretrained neural network architecture providing the best performance, we conduct several experiments that are illustrated in Fig. 3. As a result, we choose ResNet-50 as our visual feature extraction architecture.

B. Text Query Search
To extend the functionality of our Style Search Engine, we implement a text query search that allows to further specify the search criteria. This part of our engine is particularly useful when trying to search for product items that represent abstract concepts such as minimalism, Scandinavian style, casual and so on.
In order to perform such a search, we need to find a mapping from textual information to vector representation of the item, i.e, from the space of textual queries to the space of items in the database. The resulting representation should live in a multidimensional space, where stylistically similar objects reside close to each other. To obtain the above-defined space embedding, we use a Continuous Bag-of-Words (CBOW) model that belongs to word2vec model family [19]. In order to train our model, we use the descriptions of items available as a metadata supplied with the catalog images. Such descriptions are available as part of both, the IKEA and the Polyvore datasets, which we describe in details in Sec. V. Textual description embedding is calculated as a mean vector of individual words embeddings.
In order to optimize hyper-parameters of CBOW for item embedding, we run a set of initial experiments on the validation dataset and use cluster analysis of the embedding results. We select the parameters that minimize intra-cluster distances at the same maximizing inter-cluster distance.
Having found such a mapping, we can perform the search by returning k-nearest neighbors of the transformed query in the space of product descriptions from the database using cosine similarity as a distance measure.

C. Context Space Search
In order to leverage the information about different item compatibility, which is available as a context data (outfit or room), we train an additional word2vec model (using the CBOW model), where different products are treated as words. Compatible sets of those products appearing in the same context are treated as sentences. It is worth noticing that our context embedding is trained without relying on any linguistic knowledge. The only information that the model sees during training is whether given objects appeared in the same set. Fig. 4 shows the obtained feature embeddings using t-SNE dimensionality reduction algorithm [34] for IKEA dataset. One can see that some classes of objects, e.g. those that appear in a bathroom or a baby room, are clustered around the same region of the space.

D. Blending Methods
Let us denote p = (i, t) to be a representation of a product stored in the database P. This representation consists of a catalog image i ∈ I and the textual description t ∈ T. The multimodal query provided by the user is given by Q = (i q , t q ),where i q ∈ I is the visual query and t q ∈ T is the textual query.
We run a series of experiments with blending methods, aiming to combine the retrieval results from various modalities in the most effective way. To that end, we use the following approaches for blending.
Late-fusion Blending: In the simplest case, we retrieve top k items independently for each modality and take their sum as a set of final results. We do not use the contextual information here.
Early-fusion Blending: In order to use the full potential of our multimodal search engine, we combine the retrieval results of visual, textual as well as contextual search engines in the specific order. We optimize this order to present the most stylistically coherent sets to the user. To that end, we propose Early-fusion Blending approach that uses features extracted from different modalities in a sequential manner.
More precisely, for a multimodal query (i q , t q ), an initial set of results R vis is returned for visual modality -closest images to i q in terms of Euclidean distance d vis between their visual representations. Then, we retrieve contextually similar products R cont that are close to R vis results in terms of d cont distance (context space search described in section III-C). Finally, R vis and R cont form a list of candidate items from which we select the results R by extracting the textual features (word2vec vectors) from items descriptors and rank them using distance from the textual query d text .
This process can be formulated as: where n 1 , n 2 and n 3 are parameters to be chosen empirically.

IV. DEEPSTYLE: MULTIMODAL STYLE SEARCH ENGINE
WITH DEEP LEARNING Inspired by recent advancements in deep learning for computer vision, we experiment with end-to-end approaches that learn the embedding space jointly. In this section, we propose neural network architectures that are fed with image and text as inputs, while learning a multimodal embedding space. Such embedding can later be used to retrieve results using a multimodal query. The first proposed architecture is a multimodal DeepStyle network that learns common image-text embedding through classification task. The second, DeepStyle-Siamese network, improves over the first network by introducing a second branch with shared weights and contrastive loss to learn map pairs from the same outfit close in the embedding space.
DeepStyle: Our proposed neural network learns common embedding through classification task. Our architecture, dubbed DeepStyle, is inspired by [7], where they use a multimodal joint embedding for fashion product retrieval. In contrast to their work, our goal is not to retrieve images with text query (or vice versa) but to retrieve items where a text query compliments the image and provides additional query requirements.
Similarly to [7], our network has two inputs -image features (output of penultimate layer of pretrained CNN) and text features (processed with the same word2vec model trained on descriptions). We then optimize for classification loss to enforce the concept of semantic regularities. For this purpose, product category labels (with arbitrary number of classes) should be present in the dataset. Unlike [7], we do not consider the image and the text branches separately for predictions but add a fully connected layer on top of the concatenated image and text embeddings that is used to predict a single class.
DeepStyle-Siamese: We want to also include context information (whether or not two items appeared in the same context) to our network. For this purpose, we design a Siamese network [35] where each branch has a dual input consisting of image and text features. Positive pairs are generated as image-text pairs from the same outfit while unrelated pairs are obtained by randomly sampling an item (image and description) from a different outfit.
Two types of losses are optimized. Classification loss is used as before to help network learn semantic regularities. Also, minimizing contrastive loss encourages image-text pairs from the same outfit to have a small distance between embedding vectors while different outfit items to have distance larger than a predefined margin.
Formally, contrastive loss is defined in the following manner [35]: where d is the Euclidean distance between two different embedded image-text vectors (i, t) and (i , t ), y is a binary label indicating whether two vectors are from the same outfit (y = 0) or from different outfits (y = 1) and m is a predefined margin for the minimal distance between items from different outfits. Full training loss consists of weighted sum of contrastive loss and cross entropy classification losses: where L X is the cross entropy loss, Cl 1 (i, t) and Cl 2 (i, t) are outputs of the first and second classification branches respectively andỹ(i, t) is the category label for product with image i and text description t. Parameters α, β, γ are treated as hyperparameters for tuning.   Fig. 5 that has shared weights between the image-text pairs. Three kinds of losses are optimised -the classification loss for each image-text branch and the contrastive loss for image-text pairs. Contrastive loss is computed on joint image and text descriptors.

V. DATASETS
Although several datasets for standard visual search methods exist, e.g. Oxford 5K [13] or Paris 6K [36], they are not suitable for our experiments, as our multimodal approach requires an additional type of information to be evaluated. More precisely, dataset that can be used with a multimodal search engine should fulfill the following conditions: • It should contain both images of individual objects as well as scene images (room/outfit image) with those objects present. • It should have a ground truth defining which objects are present in scene photo. • It should also have textual descriptions. We specifically focus on datasets containing pictures of interior design and fashion as both domains are highly dependant on style and would benefit from style search engine applications. In addition, we analyze datasets with varying degrees of context information, as in real life applications it might differ from dataset to dataset. For example, in some cases (specifically when the database is not very extensive), items can co-occur very often together (in context of the same design, look or outfit). Whereas in other cases, when database of available items is much bigger, the majority of items will not have many co-ocurrences with other items. We apply our Multimodal Search Engine for both types of datasets and perform quantitative evaluation to find the best model.

A. Interior Design
To our knowledge, there is no publicly available dataset that contains the interior design items and fulfill previously mentioned criteria. Hence, we collect our own dataset by scraping the website of one of the most popular interior design distributors -IKEA 1 . We collect 298 room photos with their description and 2193 individual product photos with their textual descriptions. A sample image of the room scene and interior item along with their description can be seen in Fig. 7. We also group together products from some of the most frequent object classes (e.g. chair, table, sofa) for more detailed analysis. In addition, we divide room scene photos into 10 categories based on the room class (kitchen, living room, bedroom, children room, office). The vast majority of furniture items in the dataset (especially from the frequent classes above) have rich context as they appear in more than one room.

B. Fashion
Several datasets for fashion related tasks are already publicly available. DeepFashion [37] contains 800 000 images divided into several subsets for different computer vision tasks. However, it lacks the context (outfit) information as well as the detailed text description. Fashion Icon [28] dataset contains video frames for human parsing but no individual product images. In contrast, Polyvore [30] dataset has satisfied our dataset conditions mentioned before.
Polyvore dataset contains 111 589 clothing items that are grouped into compatible outfits (of 5-10 items per outfit). We perform additional dataset cleaning -remove non-clothing items such as electronic gadgets, furniture, cosmetics, designer logos, plants, furniture. In addition, we perform additional scraping of Polyvore 2 website for product items in the cleaned dataset to obtain longer product descriptions and add the descriptions where they are missing. As a result, we have 82 229 items from 85 categories with text descriptions and context information. Context information is much weaker when compared to IKEA dataset. Only 30% of clothing items appear in more than one outfit.
Item (query) images are already object photos. Therefore, for fashion dataset object detection step from style search engine is omitted for evaluation. 2 http://polyvore.com

VI. EVALUATION
In this section we will present the evaluation procedure, as well as the quantitative results.

A. Evaluation Metrics
Similarity score: As mentioned in Sec. II-C, defining a similarity metric that allows quantifying the stylistic similarity between products is a challenging task and an active area of research. In this work, we propose the following similarity measure that is inspired by [24] and based on the probabilistic data-driven approach.
Let us remind that P is a set of all possible product items available in the catalog. Let us then denote C to be a set of all sets that contain stylistically compatible items (such as outfits or interior design rooms). Then we search for a similarity function between two items p 1 , p 2 ∈ P which determines if they fit well together. We propose the empirical similarity function s c : P × P → [0, 1] which is computed in the following way: In fact, it is the number of compatible sets C i that are empirically found from C, in which both p 1 and p 2 appear, normalized by the maximum number of compatible sets in which any of those items occur. This metric can be interpreted as an empirical probability for the two objects p 1 and p 2 to appear in the same compatible set and it is expressed by the similarity score lying in the interval [0, 1] In order to account for datasets that have weak context information (where two items rarely co-occur in the same compatible set), we add an additional similarity measure s n that is directly derived from their name overlap. It counts for overlap of some of the most frequent descriptive words such as elegant, denim, casual, etc. It should be mentioned, however, that product name information should be independent from the text description (that is used during training). As a result, name-derived similarity is non-zero only on datasets that have this kind of additional name information.
where W f is a set of frequent descriptive words appearing in the name of item f . To summarize, an evaluated pair is considered to be similar if either of the two conditions is satisfied: • items co-occurred in the same outfit before • names of the two items are overlapping Formally, s(p 1 , p 2 ) = max (s c (p 1 , p 2 ), s n (p 1 , p 2 )) .
Intra-List similarity: Given that our multimodal query search engine provides a non-ranked list of stylistically similar items, the definition of the evaluation problem differs significantly from other information retrieval domains. For this reason, instead of using some of the usual metrics for performance evaluation like mAP [38] or nDCG [39], which use a ranked list of items as an input, we apply a modified version of the established metric for non-ranked list retrieval. Inspired by the [40], we define the average intra-list similarity for a generated results list R of length k to be: that is an average similarity score computed across all possible pairs in the list of generated items. By doing so, we are aiming to assess the overall compatibility of the generated set. As mentioned in [40], this metric is also permutation-insensitive, hence the order of retrieved results does not matter, making it suitable for not ranked results.

B. Baseline
In experiments, we compare the results with a recent multimodal approach to item retrieval, namely Visual Search Embedding (VSE) [9]. For evaluation, we fine-tune the weights of a pretrained model made publicly available by authors on our datasets. The model was pretrained on MS COCO dataset that has 80 categories with broad semantic context, hence it's applicable to our datasets. For feature extraction we use VGG 19 [41] architecture as suggested by authors.
We also compare our method with Late and Early-fusion Blending strategies.

C. Results
Evaluation protocol: In order to test the ability of our method to generalize, we evaluate it using a dataset different from the training dataset. For both datasets, we set aside 10% of the initial number of items for that purpose. All results shown in this section come from the following evaluation procedure: 1) For each item/text query from the test set we extract visual and textual features. 2) We run engine and retrieve a set of k most compatible items from the trained embedding space. 3) We evaluate the query results by computing an Average Intra-List Similarity metric for all possible pairs between the retrieved items and the query, which gives k 2 pairs for k retrieved items. 4) The final results are computed as the mean of AILS scores for all of the tested queries. It should be noted that for the IKEA dataset, object detection is performed on room images and similar items are returned for the most confident item in the picture. On the other hand, for Polyvore dataset, the test set images are already catalog items of clothes on white background, hence the object detection is not necessary and this step is omitted.
Quantitative results: Tab. I shows the results of the blending methods for the IKEA dataset in terms of the mean value of our similarity metric.
When analyzing the results of blending approaches, we experiment with several textual queries in order to evaluate system robustness towards changes in the text search. We observe that DeepStyle approach outperforms the baseline and other blending methods for almost all text queries achieving the highest average similarity score. DeepStyle-Siamese approach gives the best results, outperforming the VSE baseline by 21% for IKEA dataset and 18% for Polyvore dataset. Tab. II shows the results of all of the tested methods for the Polyvore dataset in terms of the mean value of our similarity metric. Here, we also evaluate two joint architectures, namely DeepStyle and DeepStyle-Siamese. Fig. VI-C shows that DeepStyle architecture yields better results in terms of an average performance over different textual queries, when compared to our previous manual blending approaches, as well as the VSE baseline approach. In this case, DeepStyle-Siamese also yields the best average similarity results. In terms of an average performance, it scores by 32% higher, when compared to the simplest baseline model and more than 4% higher, when compared to DeepStyle.

VII. WEB APPLICATION
To apply our method in real-life application, we implemented a Web-based application of our Style Search Engine. The application allows the user either to choose the query image from a pre-defined set of room images or to upload his/her own image. The application was implemented using Python Flask 3 -a lightweight server library. It is currently Fig. 9. Mean AILS metric scores for selected textual queries and the average of the mean scores for all of the methods. We can see that our DeepStyle-Siamese architecture significantly outperforms other architectures on multiple text queries.

VIII. CONCLUSIONS
In this paper, we experiment with several different architectures for multimodal query item retrieval. This includes manual result blending approaches as well as joint systems, where we learn common embeddings using classification and contrastive loss functions. Our method achieves state-of-the-art results for the generation of stylistically compatible item sets using multimodal queries. We also show that our methodology can be applied to various commercial domain applications, easily adopting new e-commerce datasets by exploiting the product images and their associated metadata. Finally, we deploy a publicly available web implementation of our solution and release the new dataset with the IKEA furniture items.