RecipeBowl: A Cooking Recommender for Ingredients and Recipes Using Set Transformer

Countless possibilities of recipe combinations challenge us to determine which additional ingredient goes well with others. In this work, we propose RecipeBowl which is a cooking recommendation system that takes a set of ingredients and cooking tags as input and suggests possible ingredient and recipe choices. We formulate a recipe completion task to train RecipeBowl on our constructed dataset where the model predicts a target ingredient previously eliminated from the original recipe. The RecipeBowl consists of a set encoder and a 2-way decoder for prediction. For the set encoder, we utilize the Set Transformer that builds meaningful set representations. Overall, our model builds a set representation of an leave-one-out recipe and maps it to the ingredient and recipe embedding space. Experimental results demonstrate the effectiveness of our approach. Furthermore, analysis on model predictions and interpretations show interesting insights related to cooking knowledge.


I. INTRODUCTION
Finding the right additional ingredients and sample recipes is an essential, yet challenging task in the culinary world due to vast cooking possibilities [1]. Previous works have attempted to build food recommendation systems [2], [3] using small recipe datasets and shallow data-driven approaches. Food pairing tasks [4]- [6] have been proposed, but were limited to one-to-one ingredient recommendation. With multiple ingredients available, a system that is able to provide reasonable ingredient and candidate recipe choices based on sophisticated cooking knowledge may be desirable.
In this work, we propose RecipeBowl, a set-based model that jointly recommends ingredients and recipes. For example in Figure 1, given lime, chicken breasts, olive oil and garlic as input set, the user desires to cook an 'easy', 'main dish' grilled in an 'oven' using 'chicken'. In this case, the RecipeBowl suggests ingredients (e.g., balsamic vinegar, cilantro, white wine, rosemary and so on) that are likely to go well with the input set and satisfy the user's needs. Moreover, candidate recipes (e.g. Easy Garlic Chicken, Grilled Pesto The associate editor coordinating the review of this manuscript and approving it for publication was K. C. Santosh . Chicken and so on) are also provided to guide the user's decisions on cooking.
We formulated a recipe completion task where the model is given a leave-one-out set of ingredients and tag information to predict one target ingredient previously excluded from the original set. We constructed a dataset based on a large recipe corpus Recipe1M [7]- [9] where each instance consists of an leave-one-out set as input and target ingredient as output. We then trained the model in a supervised learning VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ setting where it has to predict the target ingredient and its corresponding recipe given the leave-one-out set. The main objective is to simultaneously learn two different embedding spaces and push its vector projections towards the actual vector representations in each space. The trained model provides recommendations based on similarity-based rankings calculated between its predicted ingredient/recipe with the actual ones in each of their embedding spaces. We performed quantitative and qualitative analysis on our model's recommendations to demonstrate the viability of our approach. Experimental results show that our model suggests reasonable ingredients that are relevant to recipe context. Observations on the predicted embedding space in t-sne visualizations, set context vectors in clustermaps and attention weights in heatmaps provide insight of how RecipeBowl utilizes recipe contextual knowledge and derives it from various ingredient combinations.
The major contributions are summarized as follows.
• We formulate a recipe completion task that trains a model on set-to-one prediction in a supervised learning setting.
• We propose RecipeBowl, a two-way cooking recommender model that adopts the Set Transformer [10] framework for building representations of ingredient sets . 1 • We introduce a large-scale recipe completion dataset [8], [9] using Tf-Idf scores for selecting optimal target ingredients.
• Both quantitative and qualitative analysis show that RecipeBowl suggests practical choices based on recipe context and ingredient relations.

II. RELATED WORK A. LEARNING RECIPE REPRESENTATIONS
Cross-modal features, namely text and image features have been widely used for generating recipe representations [7], [11], [12]. These methods require image data to conduct recipe-related tasks.  [8] to predict food pairing scores and discover novel ingredient pairings [6]. Haussman et al. incorporated semantic-driven knowledge graphs for food recommendation [13]. While the previously mentioned authors either utilized chemical information in ingredients or a large recipe corpus in food pairing related tasks, Park et al. further proposed to incorporate both aspects to construct a large scale ingredient-compound network called FlavorGraph using metapaths [14]. Prior works on recommending ingredients have also been proposed. Shino et al. used ingredient categories and cooccurrence relations to suggest suitable alternative ingredient for a given recipe [15]. Liu et al. extended this approach by considering the diversity of ingredient categories and novelty of ingredient combinations [16]. De Clercq et al. used nonnegative matrix factorization and number of shared flavor compounds information to retrieve eliminated ingredients from recipes [3].

2) RECOMMENDING RECIPES
Previous works have focused on personalized recommendation of recipes using various features and employing machine learning-based approaches [17]- [20]. Ge et al. proposed to incorporate users' tags and ratings that indicate food preferences in recommendation [17], we employed a similar approach by utilizing recipe tag information such as main dish, 5-minute-cooking. Other works have additionally taken nutrition-related factors into account to provide healthy food recommendations [21]- [24].
Perhaps one of the previous works that is closest to our task formulation is Cueto et al. [25]. The authors of this work employed memory-based collaborative filtering approaches to recommend ingredients for a given partial recipe. However, the dataset used in their work is small compared to our work as we trained our deep learning-based model on Recipe1M [8]. Moreover, while Cueto et al.'s model suggests only additional ingredients, our model is trained both on ingredient and recipe representations and provides each of their recommendations.

A. PREPROCESSING ORIGINAL DATASET
We built an extended version of the Reciptor [9] dataset containing 507,834 recipes which is a subset of Recipe1M [7], [8]. Each recipe instance in our preprocessed dataset contains a list of ingredients, cooking instructions and cooking tags (630 unique tags) that were previously extracted from Recipe1M. Since the rich tag information (e.g., easy, healthy, seasonal [preference], main-dish, desserts, fruit [cuisine category], meat, vegetarian, low-calorie [diet information], american, european, asian [regional category]) from Reciptor would be helpful in our task [17], we crafted a 630-dimensional tag information binary vector for each recipe instance. We prepared 3,729 unique ingredients and a 80%/10%/10% randomly partitioned dataset. Prior to dataset construction, we excluded recipes with few (4 or less) ingredients from each of the partitioned dataset. Therefore, the dataset has 373,760 training recipes, 47,104 validation recipes and 47,104 test recipes.

B. SELECTING TARGET INGREDIENTS
We adopted De Clercq et al.'s recipe completion-based approach for training RecipeBowl [3]. The model is trained to predict a target ingredient x given a leave-one-out set X where x was previously eliminated from a original set X ∪ {x} of ingredients. Based on the above learning objective, we constructed a dataset for recipe completion where each instance includes an leave-one-out ingredient set, target ingredient and cooking tag information. Our main emphasis is to help the model learn cooking context based on the combinatory nature of various ingredients. In De Clercq et al.'s work, the target ingredients were selected randomly [3]. Among the randomly selected ingredients, commonly occurring ones such as salt and butter may act as trivial targets. These ingredients may render the model unable to differentiate the characteristics of ingredient combinations.
To prevent this, we selected target ingredients based on their Tf-Idf (Term Frequent-Inverse Document Frequency) score where terms and documents are ingredients and recipes respectively [26]. The Tf-Idf score indicates the relative importance of an ingredient within the recipe based on its occurrence in the whole corpus. We first calculated the Tf-Idf scores based on all ingredients, and then normalized them within each recipe where term frequency for each ingredient is always 1 in each recipe. We selected an ingredient x with the highest Tf-Idf score and eliminated it from each recipe.
Conclusively, the inputs for training RecipeBowl on recipe completion is the leave-one-out set X while the target is x for each recipe instance X ∪ {x}.  We further justify our target selection approach by the following analysis. Figure 2. shows two distributions of target ingredients based on different selection options (Random and Tf-Idf). The distribution based on random selection is skewed where the highly frequent target ingredients based on random selection are commonly used ingredients (e.g. salt, butter and sugar) in most recipes. On the other hand, the distribution based on Td-Idf selection is relatively uniform which provides a better learning setting for RecipeBowl.
Along with recommending ingredients, RecipeBowl aims to simultaneously suggest recipe candidates. We utilized the pretrained recipe embeddings from Reciptor [9] as ground truths for training the recipe inference task of our model. Since the pretrained embedding vectors include sequential recipe context, we expect RecipeBowl to suggest acceptable recipe candidates and benefit ingredient recommendation.

A. OVERVIEW
RecipeBowl takes a set of ingredients as input and predicts a corresponding target ingredient and recipe as output (Figure 3.). The ingredients including the target are represented as continuous vectors retrieved from an embedding lookup table initialized by the ingredient node embeddings from FlavorGraph [14]. The target recipe vectors are pretrained embeddings retrieved from Reciptor [9]. The RecipeBowl consists of the Set Encoder and the 2-way Decoder. The Set Encoder encodes a set of ingredient vectors into a set context embedding space. The 2-way Decoder maps the set context vector into two different embedding spaces of different modality. The model is trained to approximate the predicted vector to its target ingredient and recipe vector in its corresponding embedding space and is trained in a multitask learning fashion.

B. SET ENCODER -LEARNING SET REPRESENTATIONS
We adapt the Set Transformer framework in our model as the Set Encoder module to build latent representations for incomplete sets of ingredients using attention mechanism [10]. In this work, we constructed the Set Transformer as a stack of components including Induced Set Attention Blocks (ISAB) and a Multihead Attention based pooling (PMA) layer. The ISAB is fed with a input set of vectors to calculate selfattention weights between the elements where the final output is also a set of equal size. The PMA layer aggregates the element-wise features by calculating their attention weights on a set of parameterized seed vectors. Both ISAB and PMA layer use Multihead Attention Blocks (MAB) which are the components of the Transformer model originally proposed by Vaswani et al. [27]. The MAB computes the attention function with multiple projections of the input queries and keyvalue pairs. Different from the Set Transformer in Li et al.'s Reciptor [9], we constructed our version of Set Transformer with one ISAB followed by one PMA layer.

1) MULTIHEAD ATTENTION
Given a set of n d q -dimensional query vectors Q ∈ R n×d and its corresponding key-value pairs K ∈ R n×d k , V ∈ R n×d v , an attention function takes Q as input and produces outputs The outputs of the above function are expressed as a weighted sum of V where each value's weight is determined by a dot product scalar of its corresponding key and the query.
An extended version of this mechanism called Multihead Attention was introduced by Vaswani et al. where multiple projections are applied to the query and key-value vectors to produce different attention-based outputs [27]. The k-head attention function has k triplets of linear transformations . . , k}) each applied to Q, K and V respectively. The k projections are each then fed into the attention function to produced k different outputs which are concatenated k-wise and finally projected into a h-dimensional space. The Multihead Attention is mathematically expressed as follows, where RFF is a row-wise feedforward layer and LayerNorm is layer normalization ( [28]).

3) SET ATTENTION BLOCK
The Set Attention Block was proposed by Lee et al. as an extension of the Multihead Attention Block to calculate selfattention weights between the vectors in a set [10]. The output from the SAB contains element-to-element interactions of the set. Higher order relations between the elements can be modeled through a stack of SABs. Our approach focuses on learning the combinatory nature of ingredients which provides a rationale for using SABs in the model architecture.
Given a set of vectors X ∈ R n×d , the SAB is expressed as follows,  [10]. Given a set of input vectors X ∈ R n×d and a set of inducing vectors K ∈ R k×d , the ISAB is expressed as follows, where H = MAB(K, X).

5) MULTIHEAD ATTENTION BASED POOLING LAYER
One of the common permutation-invariant methods to aggregate the element-wise representations is element-wise summation [29], [30]. However, Lee et al. proposed aggregating the representations by applying multihead attention on another set of m parameterized seed vectors S ∈ R m×d [10]. Given a set of n ingredient vectors refined by the previous SAB or ISAB, Z ∈ R n×d , pooling by Multihead Attention (PMA) is expressed as follows,

6) SET TRANSFORMER
Conclusively, given an input set of ingredient vectors I ∈ R n×d the Set Transformer we employed in our work is mathematically expressed as follows, S = LayerNorm(ReLU(PMA(I )W s +b s ))I = ISAB(I) (9) where W s ∈ R d×h , b s ∈ R h are the weights and biases for the final nonlinear transformation in the Set Transformer and S ∈ R h is the final latent representation for the set of ingredients. We denote this as the Set Encoder in our whole model architecture as it encodes a set of ingredients into a latent embedding space.

C. 2-WAY DECODER -PREDICTING INGREDIENTS AND RECIPES
The 2-way Decoder takes the set context vector concatenated with a 630-dimensional tag vector as input to generate the d-dimensional target ingredient vector and r-dimensional target recipe vector. The tag vectors are constraints to guide the model's predictive space. Given the encoded set representation S ∈ R d and the tag binary vector T ∈ {0, 1} 630 , the predicted vectors for both the target ingredientŷ p ∈ R d and recipeŷ q ∈ R r are mathematically expressed as follows, where W 1 , W 3 ∈ R h×d , W 2 , W 4 ∈ R d×d are trainable weights and b 1 , b 2 , b 3 , b 4 ∈ R d are trainable biases.

D. LOSS OBJECTIVE FUNCTION AND OPTIMIZATION
Given a pair of predicted and its ground truth target vectors (ŷ p , y p ), we employed a negative likelihood loss function based on a softmax over negative Euclidean distances in the ingredient embedding space [31], [32]. As we trained our model using batch sampling, the softmax for the Euclidean distance between the ith pair (ŷ p (i), y p (i)) is calculated over the batch of target ingredient vectors including y p (i). Given a batch B and model parameters , the loss objective for RecipeBowl is mathematically expressed as follows, where τ is a temperature scalar for controlling model optimization [33]. The model is therefore is trained on a distance metric learning setting since the Euclidean distance between the predicted ingredient and target ingredient is minimized [31]. Given the ith target ingredient as thet positive sample, we adopted the idea of using all other B target ingredients in a batch as negative samples for better optimization [34]. We will denote this scheme as using in-batch negatives. For training the model on recipe prediction given the ith pair (ŷ q , y q ) in the training batch, we employed the cosine embedding loss defined as below, L q ŷ q (i), y q (i), = 1 − cosine(ŷ q (i), y q (i)) (15) Finally, the multi-objective loss function for a batch of quadruplets ( y p , y p , y q , y q ) is as below, L y p , y p , y q , y q , = where L p (i), L q (i) are the simplified notations of the loss function for ith sample in batch.

A. EXPERIMENTAL SETTING
We conducted experiments to evaluate and compare our proposed RecipeBowl's performance on recipe completion task VOLUME 9, 2021 with other model options. We firstly performed a simple preliminary experiment by giving each leave-one-out input set of ingredients the same list of ingredients sorted by their occurrence as target ingredient in the whole dataset. We denote this method as Popularity Choice. We selected traditional machine learning approaches for our baseline experiments to evaluate our proposed model architecture. We imported the pre-trained FlavorGraph embeddings and summed each of the input ingredients into a single 300-dimensional continuous vector [14]. We then concatenated it with its corresponding 630-dimensional cooking tag vector. As a result, the dimension of each input vector is 930. The baseline models that were used in this setting are Random Forest Classifier, Logistic Regression and MLP Classifier and were all imported from the Scikit-learn Python package [35].

1) MODEL TRAINING AND EVALUATION METRICS
We fit the traditional machine learning models into our large training dataset and evaluated their performance based on the predicted probabilities for each class (3,729 ingredients). The predicted list of probabilities were sorted for evaluative purposes. The deep learning architectures using various Set Encoder modules including RecipeBowl and its ablated versions were trained to the maximum of 60 epochs with early stopping using the AdaBound optimizer [37]. All models were trained on the same training dataset and evaluated on the same test dataset as well. The hyperparameters for RecipeBowl that were estimated using the validation dataset and are available in the anonymous code repository.
We retrieved the predicted ingredient vectors of test dataset from the deep learning models including RecipeBowl, to generate ranking-based recommendation results. We then calculated a pairwise matrix of cosine similarity scores between the vector predictions for the incomplete ingredient set in test dataset and 3,729 actual ingredient vectors. We sorted the similarity scores to obtain a ranked list of recommended ingredients. Both lists are used for evaluation based on multi-item recommendation. We used Mean Reciprocal Rank (MRR) and Recall@K (K=1,5,10) to evaluate the recommendation results derived from the scores.

1) MODEL PERFORMANCE
We made 10 different 80%/10%/10% random splits of our dataset to perform the main experiments on the recipe completion task. In addition, the random initialization of trainable parameters in deep learning models is different according to each of the random split. For each model configuration including the traditional machine learning models, we calculated the mean and standard deviation of each evaluation metric MRR, Recall@1, Recall@5 and Recall@10. We also conducted statistical tests to obtain p-values to prove RecipeBowl's statistical significance. Table 2

2) ABLATION STUDY ON GENERAL MODEL ARCHITECTURE
We performed ablation tests to find whether 1) utilizing recipe context information, 2) employing a negative likelihood loss function based on a softmax over euclidean distances with in-batch negatives, 3) using the pre-trained FlavorGraph vectors as initial embeddings for RecipeBowl and 4) adding a Decoder before projecting the set context vectors into another embedding space were effective or detrimental to RecipeBowl's training. Table 3 shows the ablation results on RecipeBowl. All ablation experiments were performed using the first random split of our dataset. The ablation results illustrate the importance of selecting the right loss criteria for training RecipeBowl. Combining the effects of distance metric learning and in-batch  negatives randomly containing both easy and hard (highly related to targets) ingredient negatives seemingly benefit RecipeBowl's performance.
In terms of model architecture, results show RecipeBowl's dependency on both the 2-way Decoder (MRR: 0.1343) and tag vectors (MRR: 0.1463). Considering the risks of multitask learning, our ablation results show that recipe prediction task does not negatively affect RecipeBowl but rather boosts by a small amount (MRR: 0.2153). Though we imported the pre-trained FlavorGraph embeddings from Park et al.'s work, our ablation results show less difference in performance (MRR: 0.2153) leaving room for further investigation.

A. RecipeBowl RECOMMENDATIONS
The RecipeBowl accepts any ingredient sets and recommends additional ingredients and candidate recipes which is illustrated in Figure 3. In Table 4 Table 4, although RecipeBowl did not predict the correct target ingredients (bold-faced, torillas, cooked white rice), there were still meaningful suggestions. For the Mexican dish, our model recommended tortilla chips at top 1 while tortillas are ranked third. For the Rice dish, while our model did not predict perfectly (cooked white rice, out of top 10), most of the recommendations are still aligned with the target ingredient (e.g. wild rice, yellow rice). We expect RecipeBowl's flexibility and understanding in cooking to be helpful in making cooking choices. Figure 4. shows the distribution of both target and predicted embeddings vectors. While the predicted ingredients are close to their corresponding targets, the embedding seemed to be clustered into eight categories overall. This shows that the RecipeBowl model learned not only the optimal ingredient for the given set but also recipe categorical features. Figure 5. shows the distribution of sixteen target embeddings and their corresponding predictions which is illustrated in the Embedding Space of Figure 3. In this analysis, 16 target ingredients were randomly selected according to their ingredient categories along with their predictions in the test dataset. Most of the predicted ingredients tended to form clusters corresponding to the selected targets. Moreover, some target ingredients are centered in the prediction clusters (e.g. mashed bananas, bread, chicken breasts). Interestingly, clusters that belong to the same ingredient category (e.g. pork chops, chicken wings, chicken breasts) tend to be relatively close to each other. We also found target pairs bread flour&yeast and cocoa&chocolate being close to each other along with their prediction clusters. Bread flour and VOLUME 9, 2021  yeast are known to be used together in most recipes while cocoa is one of the materials for making chocolate chips. These observations show that the RecipeBowl model learned ingredient relationships during training.  Encoder, prior to being propagated to the 2-way Decoder. We selected blueberries, apples, buttermilk and chocolate chips from the previous list used in t-sne visualization and extracted incomplete ingredient lists with equal size of 150 containing each of them from the test dataset. We then used the Set Encoder of RecipeBowl to generate 4 groups of 150 set context vectors and visualized a clustermap for each group. We selected blueberries and apples since both of them are fruit ingredients used in a wide variety of dishes.  On the contrary, we additionally selected buttermilk and chocolate chips that may be used in limited recipe categories such as bakery and desserts. The clustermaps shown in Figure 6. seemed to show distinctive clusters which brought interesting insight. For example, apples can be used in a wide range of recipes such as sweet desserts (Caramel Apple), bakery foods (Apple Maple Muffin) or as sauces in meatbased dishes (Apple Pork Chops) [38], [39]. Buttermilk is widely used in bakery products due to its nutritional value and taste enhancement features [40]. We can observe that among the sampled 150 set context vectors including buttermilk, most of them were used in bakery recipes (Basic Chocolate Cake). Overall, RecipeBowl can distinguish different types of recipe context according to the uses of a particular ingredient. The detailed clustermaps for these ingredients can be found in the code repository.   Figure 3 and normalized them with min-max scaling. We studied the recommendation examples and observed which ingredient seems to have high influence towards building the set context vector. For Spicy Tuna Salad Roll, nori recieved the highest attention which helped RecipeBowl understand the set input is mainly Japanese cuisine. For BBQ Chicken Pizza, chicken breasts, fresh cilantro and red onions were majorly attentive interestingly compared to mozzarella cheese. Lastly, the input set for Casserole Quiche contained ingredients mainly used in Mexican cuisine such as bell peppers and chili peppers [41], [42]. In turn, we speculate that RecipeBowl was able to predict tortilla chips based on highly attentive values of the above ingredients as tortilla-related ingredients are also commonly used in Mexican dishes.

VII. CONCLUSION AND FUTURE WORK
We introduce RecipeBowl, a set-based cooking recommender for candidate ingredients and recipes. To train the model, we formulate a supervised learning recipe completion setting using an extended dataset from Reciptor [9]) and employing the Set Transformer [10] framework to encode ingredients into a set context representation. Based on the evaluation results from the formulated recipe completion task, our model showed best results among other set encoding variation baselines and traditional machine learning algorithms. Recommendation results demonstrate RecipeBowl's ability to generate both plausible and diverse recommendations for a given set of ingredient. We performed in-depth model analysis on RecipeBowl (illustrated in Figure 3) in a bottomto-top fashion starting from the predicted Embedding Space where the vector embeddings formed meaningful clusters. We also investigated the visualizations of the set context vectors which are the direct outputs from the Set Encoder and examined the attention weights extracted from the Set Encoder itself and found them supportive to our model's performance. In sum, our formulated recipe completion task and set representation approaches were proved to be beneficial in suggesting ingredients and recipes.
For the future work, while RecipeBowl was able to suggest both appropriate ingredients and recipe candidates for a given set of other ingredients, some recipe candidates seemed inconsistent with the suggested ingredients. We plan to improve RecipeBowl by encouraging it to recommend recipe candidates related to some of its suggested ingredients. Though our RecipeBowl exploited our custom-made Set Transformer to be trained successfully on recipe completion, we plan to improve the Set Encoder to extract richer cooking knowledge and provide better interpretability. Since the recipe completion task involves the input ingredient set having only one ingredient removed as the prediction target for each recipe, we acknowledge that the model may have limitations in generating recommendations given a few ingredients. We plan to address this issue in future work by formulating a more suitable task setting. In addition, we plan to incorporate nutritional features and consider dietary requirements during recommendation. Lastly, we plan to release an applicable version of RecipeBowl.