Service-aware Personalized Item Recommendation

Current recommender systems employ item-centric properties to estimate ratings and present the results to the user. However, recent studies highlight the fact that the stages of item fruition also involve extrinsic factors, such as the interaction with the service provider before, during and after item selection. In other words, a holistic view on consumer experience, including local properties of items, as well as consumers’ perceptions of item fruition, should be adopted to enhance user awareness and decision-making. In this work, we integrate recommender systems with service models to reason about the different stages of item fruition. By exploiting the Service Journey Maps to define service-based item and user profiles, we develop a novel family of recommender systems that evaluate items by taking preference management and overall consumer experience into account. Moreover, we introduce a two-level visual model to provide users with different information about recommendation results: (i) the higher level summarizes consumer experience about items and supports the identification of promising suggestions within a possibly long list of results; (ii) the lower level enables the exploration of detailed data about the local properties of items. In a user test instantiated in the home-booking domain, we compared our models to standard recommender systems. We found that the service-based algorithms that only use item fruition experience excel in ranking and in the minimization of the error in rating estimation. Moreover, the combination of data about item fruition experience and item properties achieves slightly lower recommendation performance; however, it enhances users’ perceptions of the awareness and of the decision-making support provided by the system. These results encourage the adoption of service-based models to summarize user preferences and experience in recommender systems.


I. INTRODUCTION
In service modeling research, Stickdorn et al. [1] point out that items are complex entities whose fruition might involve stages of interaction with multiple services and actors that jointly impact customer experience. For instance, in the services related to the circular economy, such as homebooking, the offered value goes beyond item features and includes the interaction with apartments' hosts, implying different attitudes toward renting rooms or complete homes [2]. Moreover, in online retailing, the satisfaction with products depends both on their properties and on the experience with post-sales services related to customer care.
Starting from this considerations, we point out that, when personalizing the recommendation of items, their local features and the expected experience with them should be jointly analyzed in the identification of the most relevant options, as well as in their presentation to the user.
Content-based, feature-based and collaborative recommender systems [3] base their suggestions on local item properties such as the features and aspects extracted from catalogs, and on the overall ratings received by items, which represent the only utility factors steering the recommendations. Review-based recommender systems study consumer feedback to extract data about people's experience [4], [5]. However, as they do not contextualize reviews in the stages of item fruition to which consumers are exposed, these algorithms cannot aggregate information in an effective way. To comply with these limitations, we propose to enable recommender systems to reason about consumer experience in the stages of item fruition by integrating them with service VOLUME 4, 2016 modeling techniques. As we aim at enhancing recommendation performance and user-awareness support to improve decision-making, we pose the following research questions: • RQ1: Does the extension of recommendation algorithms with a service-based representation of items (which explicitly models the item fruition stages) enhance recommendation quality in Top-N recommender systems, compared to only considering local item properties and overall ratings? • RQ2: Does the presentation of both item properties and service-based information about their fruition enhance users' awareness about the suggested options, and confidence in the selection decisions, compared to only presenting item properties?
To answer these questions we propose a family of serviceaware recommender systems that evaluate items based on individual user preferences and on evaluation dimensions associated with item fruition stages. These dimensions abstract from the individual details that emerge from item reviews. Thus, they can be used to provide the user with a holistic summary of the experience collected by previous consumers.
To specify the item fruition stages, we employ the Service Journey Maps design model [6]. We selected the home-booking domain as a test-bed for our work because it involves the user in a rich experience regarding both the home and the interaction with its host. However, our model can be applied to the suggestion of items in other domains, such as hotel booking and e-commerce in the sharing economy. In fact, in those scenarios, users can be exposed to the interaction with amateur service providers and retailers, possibly offering a low quality of service levels. Therefore, item fruition can be impacted by exogenous risk factors [7] and a summarization of customer experience can enhance the acceptance of recommendation results [8].
We compared our service-aware recommender systems to standard algorithms in a user study involving 48 participants. We tested five recommendation models and three visualization models by retrieving data about homes and reviews from the Airbnb location-based service (https://airbnb.com). The results of this study reveal that the service-based algorithms exclusively based on item fruition experience achieve the best results on ranking and minimization of error in rating estimation. Moreover, the algorithms that combine item properties with fruition experience achieve slightly lower recommendation performance. However, they enhance users' perceived awareness support and confidence in item selection. In summary, we provide two novel contributions: 1) We define different service-aware recommendation algorithms based on item features, on evaluation dimensions associated with item fruition stages, or on both. 2) We compare the performance of these algorithms to standard recommender systems in terms of utility, rating estimation and ranking capability. Moreover, we compare users' perceived quality of suggestions, their awareness about the proposed items, and their percep-tion of the interface adequacy during the interaction with these systems.
This work is framed within the Apartment Monitoring application that helps users in finding homes from Airbnb. We extend the work described in [9] with the introduction of service-aware recommendation and with a presentation model that supports the overview of recommendation lists.
In the following, we present the related work (Section II); then we describe our dataset and data processing method (Section III). Section IV introduces the recommendation models we define. Sections V and VI describe the user study we carried out and its results, which we discuss in Section VII. Section VIII outlines limitations and future work. Sections IX and X summarize the ethical issues of our work and conclude the paper.

A. SERVICE JOURNEY MAPS
The Service Journey Maps (SJMs) [6] support the design and development of products and services by focusing on the customer's viewpoint. A SJM is a visual description of user experience with a service, such as a hotel, or an online retailer, which models the stages that customers encounter during service fruition. The graphic visualization of a SJM follows a temporal line from the start point (e.g., enter website) to the end one (e.g., customer care) to describe the stages a person engages in when using the service.
Different from standard recommender systems, we employ the Service Journey Maps to describe the process underlying the fruition of the suggested items. Specifically, we use the domain model built using SJMs to steer the analysis of item reviews by clustering feedback around the specified service stages. As a result, we define a small set of evaluation dimensions that a service-aware recommender system can exploit (possibly fusing them with information about item properties) (i) to estimate item ratings and (ii) to generate a visual overview of recommendation lists, based on a holistic summary of previous consumers' experience with items.

B. RECOMMENDATION ALGORITHMS
Most recommender systems generate personalized suggestions using item-centric data that does not reflect consumer experience. Collaborative filtering evaluates items based on the ratings provided by users [10], [11]. Multi-criteria recommender systems introduce multi-dimensional ratings [14], [15]. Moreover, to ground systems' inferences on richer types of information, content-based filtering combines pure ratings with item features extracted from catalogs [16]. Some graphbased recommenders personalize the suggestions based on the chains of relations that connect users to items [18], [19], possibly by exploiting the Linked Open Data cloud [20]. Finally, hybrid recommender systems integrate different algorithms to improve their suggestions [21]- [25].
Review-based recommender systems [4] extract item features and aspects from online reviews to build user and item models [26]- [30]. Some systems estimate item ratings from their reviews [31], [32], or analyze reviews to evaluate their helpfulness to item evaluation [39]. However, as these systems ignore service modeling, they cannot aggregate the data they extract, and recognize user preferences, with respect to the stages of item fruition.
Differently, we enrich item recommendation with a holistic evaluation of consumer experience during the stages of item fruition. As the Service Journey Maps support the identification of a small number of evaluation dimensions describing such experience, they enable us to replace the detailed item aspects mentioned in the reviews with a few factors to be evaluated in rating estimation. Compared to the research about review-based recommender systems, we analyze item aspects but we synthesize the data they bring directly into the dimensions of experience. Thus, we separate the interpretation of the sentiment emerging from consumer feedback from item evaluation.

C. PRESENTATION OF RECOMMENDATION RESULTS
Different presentation styles are applied to describe results, depending on the recommendation algorithm. In collaborative filtering, users seem to appreciate the bar graphs of neighbors' ratings [13]. Content-based recommender systems typically present suggestions by highlighting the degree of match between item features and user preferences, as in [17], [40]. Moreover, in the research about exploratory search and hybrid recommender systems, several works focus on empowering the user to tune the impact of different relevance perspectives on item recommendation [21]- [24], [41].
Product comparison is a crucial decision stage that buyers usually perform before they make a choice [33]. Some aspect-based recommender systems indirectly support this activity by presenting the features of items which match, or mismatch, the target user's preferences [34]- [36], fusing recommendation and explanation of results to enhance transparency [12], [42]. Other works group items by their properties to facilitate their comparison [33], [37]. Moreover, to support the transparency of Matrix Factorization, McAuley and Leskovec match the item features extracted from reviews to latent factors used in the presentation of results [38].
As all these systems do not model the services behind item fruition, they cannot synthesize the information about consumer experience in this respect. Therefore, they present fairly long lists of aspects that they organize using metadata [37], or which they shorten by removing the less relevant aspects [21].
Differently, our work supports the organization and interpretation of aspects and features with respect to a small number of evaluation dimensions measuring consumer experience, the same for each suggested item, regardless of how many aspects characterize it. This is the basis for the generation of visual overviews of recommendation lists that limit information overload by enabling the user to selectively inspect the details of the relevant items, regarding the evaluation dimensions (s)he cares about.

III. DATA
Our experiments are based on the home-booking domain using data provided by Airbnb. That platform supports searching for homes in a large context that covers both leisure and work time. Similar to other services, such as Booking.com (https://www.booking.com), Airbnb allows customers to write at most one review for each home after the end of the renting contract. This approach enhances the reliability of consumer feedback because it guarantees that comments and evaluations are provided by people who experienced the service.

A. DATASET
For our experiments, we used a public dataset of Airbnb reviews concerning London city. 1 The dataset contains information about homes (denoted as "listings"), their hosts, and the offered amenities, i.e., item features such as Wi-Fi and washing machine. The dataset also stores the reviews about homes uploaded by their renters ("guests") but it does not report the associated ratings. From this dataset, we selected the reviews written in English and we removed the listings that did not receive any comments during the last three years. The filtered dataset contains 764,958 guests, 43,604 listings, and 906,967 reviews. Table 2 provides some descriptive statistics of the filtered dataset. It can be noticed that several reviews of this dataset are very long and mention a wide spectrum of aspects of homes, hosts, and the surrounding environment. For example: "The flat was bright, comfortable and clean and Adriano was pleasant and gracious about accommodating us at the last minute. The Brixton tube was a very short walk away and there were plenty of buses. There are lots of fast food restaurants, banks, and shops along the main street."

B. EVALUATION DIMENSIONS
The service-aware recommender systems we propose build on Mauro et al.'s work [9], which we outline to keep the paper self-contained. Mauro et al. defined a Service Journey Map (SJM) for home-booking by taking inspiration from existing maps developed for hotel booking [43], and from previous analyses about home-booking services [44]. The SJM, shown in the upper portion of Figure 1, focuses on the guest's renting experience, from the search for homes on the Airbnb website to the check-out phase. As it is aimed at describing consumer experience when entering the homes, it overlooks the interaction between the user and backstage services for reservation and payment, and it only models the guest and host roles.
The SJM includes four service stages corresponding to the main activities the guest engages in: Visit website, Check-in, Stay in apartment, Check-out. In the present work, we overlook Visit website because we are not interested in evaluating the user experience with the Airbnb platform.
In [9], the authors derived from the SJM five evaluation dimensions summarizing guests' renting experience. Moreover, they mapped the stages of the map to these evaluation

Check-in Check-out
In-apartment experience Surroundings Host appreciation

Visit website
Check-in Stay in apartment Check-out FIGURE 1. This figure is taken from [9]. The upper portion shows the stages of the Service Journey Map describing the home-booking process. Each stage is connected to the associated experience evaluation dimensions.
dimensions. See the lower portion of Figure 1. In the present work we consider four dimensions: 1) Host appreciation represents guests' perceptions of the host and of the interaction with her/him at any time of service fruition. 2) Check-in/Check-out summarizes guests' experience at check-in and check-out times. It concerns aspects such as timeliness.

3) In-apartment experience represents guests'
perceptions within the apartment. It covers aspects such as its cleanliness and comfort. 4) Surroundings describes the perception of the area where the home is located, in terms of aspects such as available services and quietness.

C. ANALYSIS OF REVIEWS ABOUT HOMES
We organize the opinions emerging from the reviews around the previously described evaluation dimensions. For each home h, we analyze its reviews in three steps that we present in the following subsections.

1) Extraction of aspects from the reviews of h and computation of sentiment
We extract the aspects and corresponding adjectives from the reviews by applying an extension of the Double Propagation algorithm [45] after having analyzed sentences through dependency parsing. After that, we count the number of occurrences (f requency) of each < aspect, adjective > pair to measure how frequently people express the corresponding opinion. Moreover, we compute the polarity of the aspect as the mean value returned by the TextBlob [46] and Vader [47] opinion mining libraries and we normalize this value to obtain an evaluation in [0, 1]. The output of this step is a list of < aspect, adjective, evaluation, f requency > tuples, one for each aspect-adjective pair that appears in the reviews of h. See the first four columns of Table 3, which concerns a sample home of our dataset.

2) Classification of aspects in evaluation dimensions
Similar to [9], we group the aspects extracted from the reviews of h by experience evaluation dimension; see the fifth column of Table 3. For this task we use four dictionaries that specify the terms typically used by people to refer to such dimensions. For instance, the In-apartment  experience dictionary includes words like "kitchen", "bed" and "bathroom".

3) Computation of the values of the experience evaluation dimensions of h
Given a home h, let's consider an evaluation dimension d (e.g., Host appreciation) and the set AA dh of < aspect, adjective > pairs extracted from the reviews of h that are classified in d. We compute the value of d in h (value dh ) as the weighted mean of the evaluations of the pairs p ∈ AA dh . For each pair, we use as weight its frequency in the reviews of h to tune its influence based on how many people share the same opinion: where f requency p is the frequency of pair p in the reviews of h, and evaluation p is the evaluation of p derived from the polarity of the aspect included in p. For instance, referring to Table 3, for the Host appreciation dimension we compute the weighted mean of the evaluation and frequency values of <host, wonderful> and <host, friendly>. In a preliminary user study, we found that people perceive the lack of information about a home as a negative evaluation factor [48]. Thus, if the reviews of h do not mention any aspects related to a dimension d, or the home has no associated reviews, we set d to 0.1.

IV. RECOMMENDATION MODELS
This section describes the service-aware recommendation models we define and the baselines we use to evaluate them. For each model we present both the algorithm underlying it and the user interface for its evaluation with users. We first present the baselines, which some of our service-aware recommenders integrate into a hybrid system. We adopt the following notation: • I is the set of items (homes) and U is the set of users (guests). • For each i ∈ I and u ∈ U , the i and u vectors represent the item and user profile, respectively. • Given i and u, we denote the rating of i estimated by the VOLUME 4, 2016 . . , f z }, f j = 1 if i offers the corresponding amenity, 0 otherwise. • User profile: u = < p 1 , . . . , p z > stores the user's preferences for the item features. For j ∈ {1, . . . , z}, u j has value 1 ("It's very important"), 0 ("I don't like it"), or 0.5 ("I don't care", default value). This algorithm focuses on the features that u likes or dislikes. It estimatesr ui by normalizing in [1,5] the cosine similarity between the projections of u and i vectors (denoted as ⃗ u and ⃗ i) on the components whose value is 0 or 1: where · is the scalar vector product, ∥ · ∥ F is the Frobenius Norm and * is the decimal product. If ⃗ u is empty,r ui is computed by applying a standard popularity-based recommendation algorithm (POP) that suggests the items which received the highest number of reviews.

2) User interface
Acquisition of user preferences (Figure 2). The system shows the amenities offered by the visualized home and enables the user u to declare the importance of the corresponding preferences. The right sidebar enables u to select the amenities that the home lacks but other homes offer; for each selected amenity, the system sets u's preference to "It's very important". However, if u marks the same amenity both as preferred and as disliked when viewing different homes, the ambiguity in the user's behavior is interpreted by setting to "I don't care" the preference in u.
The rating elicitation component at the bottom of the page is not relevant to FEATURES but the interface includes it because it is used in CBF; see Section IV-B. This widget shows a list of smilies mapped to the [1,5] scale, and the "I don't know" button enabling users to opt-out if they are not able to evaluate a home. We omit details that could influence the item evaluation, such as name, price, number of accepted guests, and picture [42].
Presentation of suggestions. Figure 3 shows the user interface supporting the visualization of the recommendation list and the evaluation of items. The amenities (features) that the user has marked either as liked (e.g., Air conditioning), or disliked (none), are in boldface.

1) Model
This is a content-based recommendation algorithm [16]:  rated (in [4,5]) at least one item that offers the corresponding feature, 0 otherwise.
If u has positively rated at least one item, CBF evaluates i by computing the cosine similarity between u and i, normalized in the [1,5] interval. Otherwise, it uses POP.

2) User interface
For the acquisition of the user's preferences this system exploits the user interface shown in Figure 2. The presentation of the recommendation list is similar to the one of  Table 3. • User profile: u = < importance 1 , . . . , importance m > stores the estimated importance of d 1 , . . . , d m to u, i.e., how strongly each of them impacts item selection. For j ∈ {1, . . . , m}, we infer importance j by normalizing in [0, 1] the Pearson correlation between the overall item ratings provided by u, and the values of the evaluation dimension d j in the respective items. In the computation of the correlation, we ignore the "I don't know" ratings because they are not informative. Intuitively, if u evaluates positively the items having high values in d j , and negatively the items having low values in the same dimension, we hypothesize that d j is important to her or him. Conversely, if u's ratings are inconsistent with respect to the values of d j , it is likely that the interest in d j is low. Given u and i, we compute the rating of i as follows: where imp ju is the importance of dimension d j in u and value ji , is the evaluation of dimension d j in i. The (imp * value + 1 − imp) expression tunes the terms of the product (i) by smoothing the impact of low values if they refer to dimensions that u does not care about, and (ii) by maintaining the value of important dimensions thanks to the "1 − imp" addendum.
If u has not evaluated any items in the user preferences acquisition phase, STAGES estimates ratings by using POP.

2) User interface
Acquisition of user preferences. Figure 4 shows the user interface to elicit the importance of evaluation dimensions from the user. For each home h, the system shows: • A bar graph that summarizes the consumer experience with h extracted from its reviews (one colored bar for each evaluation dimension). Even though these values are in [0, 1], the bars are displayed in [1,5] for coherency with the five-point scale used to rate homes. • The rating elicitation component to evaluate the home. • The reviews of h. To support information filtering, the system enables the user to select one or more evaluation dimensions by clicking on the respective bars, or on the list of dimensions located above the reviews. In both VOLUME 4, 2016 cases, the system shows the comments including at least one aspect that refers to the selected dimension(s). We use color-coding to highlight the corresponding terms in the comments. For instance, the figure shows a selection of reviews related to In-apartment experience.
As explained in Section III-C, we use dictionaries to group aspects by dimension. Presentation of suggestions. This user interface is very similar to Figure 4 but specifies that it shows the personalized suggestions proposed by the system. The bar graph provides the user with a summary of consumer experience with items. Moreover, the user can retrieve detailed comments by inspecting the reviews in a selective way. The amenities offered by the home are hidden.

D. FEATURES-STAGES (SERVICE-AWARE) 1) Model
This algorithm combines the information about item features with the service-based perspective on consumers' experience to offer the user a complete view of items. It integrates feature-based and service-based recommendation by computing item ratings as the arithmetic mean of the ratings estimated by FEATURES and STAGES.

2) User interface
We omit the user interface for the acquisition of user preferences because we tested this model on the user profiles built using the user interfaces of FEATURES and STAGES, which provide the preference data to feed it.
In the presentation of suggestions we combine the user interfaces of FEATURES and STAGES by using tabs to support the exploration of both types of information. See the two homes visualized in Figure 5.

1) Model
This algorithm integrates content-based filtering with service-based recommendation. It computes item ratings as the arithmetic mean of the ratings estimated by CBF and STAGES.
2) User interface CBF-STAGES uses the same user interface as FEATURES-STAGES to elicit user preferences and present the recommendations to the user.

V. STUDY DESIGN
We aim at testing the recommendation performance and the level of decision-making support provided by the five models described in Section IV.

A. CONTEXT
We carried out the user study by exploiting an interactive test application that we developed to guide participants through the phases of the experiment without our intervention. Section IV has described a portion of the user interface of that system; see Figures 2 to 5.
People joined the study on a voluntary basis, without any compensation, and they gave their informed consent to participate in it. We recruited (≥ 18 years old) participants using social networks and mailing lists. In the message presenting

Construct Factor Statement
Perceived Quality of Recommendations (Q)

Q1
The items recommended to me matched my interests.

Q2
This system gave me good suggestions. Q3 The items recommended to me are similar to each other.
Perceived User-Awareness Support (U)

U1
This system explains why the products are recommended to me. U2 I understood why the items were recommended to me. U3 This recommender system made me more confident about my decision.
Interface Adequacy (I) I1 The labels of this recommender system interface are clear. I2 Finding an item to book with the help of this recommender system is easy. I3 The information provided for the recommended items is sufficient for me to make a booking decision. the experiment, we specified that we were looking for people who had previously used a home or hotel booking system.

B. METHOD
We applied the within-subjects design to the user study.
We considered each treatment condition as an independent variable and every participant received all the treatments. In the test application, we counterbalanced the order of tasks to minimize the impact of result biases and the effects of practice and fatigue. The experiment took on average 36.79 minutes with a Standard Deviation = 19.83. To comply with diverse users' backgrounds and levels of confidence with technology, we did not impose any time limits to complete the study, which was organized in three phases: 1) The test application asked users to declare whether they were ≥ 18 years old or not; moreover, it asked them to express their consent to participate in the study. People could continue the test only if they positively answered the first question and they accepted the consent. 2 Then, the application proposed participants to fill in a questionnaire that inquires basic demographic information, cultural background, familiarity with booking platforms, and whether they tend to trust a person or thing, even though they have little knowledge about it. The questionnaire is an adaptation of the ResQue one for recommender systems [50]. 2) The application acquired participants' preferences and 2 The text of the consent is available here: https://bit.ly/3jjYlEa. built their user profiles. 3 For this purpose, it asked people twice, in different moments of the experiment, to rate ten homes; see Figures 2 and 4. The application also asked to rate the homes presented in one suggestion list for each tested algorithm (Figures 3  and 5). Each list contained five homes to be evaluated according to their suitability as candidates for rent, using the star-based rating elicitation component. After the evaluation of each recommendation list, the test application proposed a post-task questionnaire in which users declared their degree of agreement with the statements reported in Table 4. The questionnaire is a subset of ResQue. In the table, statements are grouped in three constructs: Perceived Quality of Recommendations (Q); Perceived User-Awareness Support (U), and Interface Adequacy (I).
3) Home-booking is a high-investment domain: similar to [51], the definition of "investment" rests on the concept of price. Thus, we hypothesized that people need detailed information and feedback about items to make a renting decision. To check this hypothesis, before closing the experiment, our application asked participants to answer the post-test questionnaire of Table 5. This questionnaire is aimed at understanding to what extent they considered the visualization of amenities and the summarization of consumer experience important in the evaluation of the system's suggestions.

A. DEMOGRAPHIC DATA AND BACKGROUND
We conducted a power analysis to determine the minimum number of participants to obtain statistically significant results. A calculation of power analysis involves the following four parameters: Alpha (α = 0.05): a p value that indicates the probability threshold for rejecting the null hypothesis when there is no significant effect (Type I error rate). Power = 0.80: the probability of accepting the alternative hypothesis if it is true (Type II error rate). Effect size = 0.40: the expected effect size, i.e., the quantified magnitude of a result present in the population; our goal was to find medium-sized effects. 15 people declared that they use e-commerce platforms or online booking services few times a month, 8 use them 1-3 times a week, 11 daily, and 14 a few times a year. Finally, 4 participants declared that they very probably would trust a person or thing, even though they had little knowledge about it, 15 probably would trust it, 23 probably would not trust it, and 6 very probably would not trust it.

B. RECOMMENDATION QUALITY
We evaluated the recommendation performance of the algorithms by focusing on ranking because the placement of good solutions at the top of a recommendation list is important to support their identification. Moreover, we considered the minimization of rating estimation errors as an accuracy measure. We computed the following metrics: • NDCG (Normalized Discounted Cumulative Gain). It measures the ranking quality. The gain of items is ac-cumulated from the top of the result list to the bottom and it is discounted logarithmically at lower ranks. • RMSE (Root-Mean-Square Error) and MAE (Mean Absolute Error). They are used to compute the error between the ratings predicted by the algorithm, and the real rating given by the participants of the user study. • Utility. This accuracy metric computes a score for the whole list (rather than individual items) based on user ratings. The worth of the suggested items declines for the lower positions of the list. The formula for a list of five suggestions is the following: where r uij is the rating given by a user u to the item in the j th position; n represents the neutral vote (we set it to 3); α is a half-life parameter that corresponds to the position of the item in the list with 50% chance of being inspected and rated by the user. In our experiments, users rated all the five items of the list, thus α = 5. Table 6 shows the evaluation results. We conducted a oneway ANOVA analysis to compare the performance of the algorithms. We only computed it on NDCG and Utility because RMSE and MAE are not computed per user, but on the overall set of ratings. We found significance on both metrics: NDCG [F (232,4) = 4.31; p < 0.003], and Utility [F (232,4) = 7.58; p < 0.001].
We then conducted a post-hoc comparison using a Tukey HSD test. We found that STAGES has the best NDCG, with significant results compared to CBF (p < 0.05), and CBF-STAGES (p < 0.003). Regarding the Utility, the best performing model is again STAGES that obtained significant results compared to CBF (p < 0.01), and CBF-STAGES (p < 0.001). As far as the minimization of error in rating estimation is concerned, the best performing algorithm is FEATURES.
The last column of Table 6 reports the number of opting-outs ("I don't know" ratings) in the evaluation of homes. This phenomenon was more frequent when using CBF (10 occurrences), STAGES (9 occurrences), and FEATURES (6 cases). Differently, CBF-STAGES and FEATURES-STAGES, which also show item reviews, did not get any "I don't know" evaluations. Post-task questionnaire results for each recommender system. For each mean value, the asterisks denote the statistical significance of the difference between the best-performing algorithm and the other ones. Significance levels: (***)p < 0.001, (**)p < 0.05.

Construct
Factor Recommendation Algorithm     Table 7 shows the results of the post-task questionnaire for each of the tested algorithms, grouped by user experience constructs. A one-way ANOVA analysis comparing user experience in the recommendation algorithms showed significance on all the constructs:

2) Structured Equation Model analysis
We performed the Structured Equation Model analysis [52] to gain a deep understanding of the user experience with the five recommenders. This analysis is useful to find the relationship between unobserved constructs (latent variables) by leveraging observable variables. It is difficult to define measures that perfectly represent user experience with an intelligent system that includes a recommendation algorithm. However, we can use different elements defined by statements to measure user experience and group them into constructs to find the relations with the algorithms.
Based on the post-task questionnaire 4, we associated two constructs (Perceived User-Awareness Support and Perceived Quality of Recommendations) to Decision-making Support (DS) aspects; one construct (Interface Adequacy) to User Interfaces aspects, and we tested five Algorithms (ALG) represented as dummy variables (CBF, FEATURES, STAGES, CBF-STAGES and FEATURES-STAGES in Figure 6). These constructs are good candidates for a Structured Equation Model because they include at least three statements each.
We performed the Confirmatory Factor Analysis to check the validity of the constructs. This analysis requires: 1) The computation of the convergent validity to check that the statements of the constructs are related. For this purpose, we examined the Average Variance Extracted (AV E) of each construct, which must be over 0.50 to be respected.
2) The computation of the discriminant validity to check that the statements belonging to different constructs are not related. In this case, the squared root of the AV E value must be less than the correlation value. All the constructs we defined respected the required constraints:  Figure 6 shows the Structural Equation Model with dependencies and β-coefficients and standard error that indicate the correlations between the constructs. The Interface Adequacy has a positive effect (+0.938; p < 0.001) on the Perceived User-Awareness Support. This can be explained by the fact that the user-awareness support given by the system is influenced by how items are presented. Moreover, there is a positive correlation (+1.306; p < 0.001) between the Perceived User-Awareness Support and the Perceived Quality of Recommendations. This suggests that, when users feel that they have enough information about items, they perceive that the suggestions have higher quality.
All the algorithms, except for STAGES, positively affect the Perceived User-Awareness Support. This suggests that consumer feedback alone is not enough to choose a home for rent. Indeed, consumer feedback does not guarantee that the home has the amenities that the user needs. It is worth noticing that FEATURES-STAGES shows the largest correlation value with Perceived User-Awareness Support (+0.474; p < 0.001). We can explain this with the fact that, by explicitly listing the offered amenities, the overview of consumer feedback (bar graph), and the reviews, the algorithm supports decision-making in a complete way. Looking at the Perceived Quality of Recommendations, we observe that all the algorithms, except for CBF-STAGES, have a positive correlation with this aspect. We believe that CBF-STAGES has a negative correlation (-0.219; p < 0.05) because of its low evaluation performance in accuracy, ranking and error estimation; see Section VI-B. Finally, STAGES has the largest correlation (+0.752; p < 0.001) with Perceived Quality of Recommendations. We explain this finding with the fact that consumer feedback is a very useful information source to generate good predictions. Table 8 shows the results of the post-test questionnaire.

3) Post-test results
In-apartment experience and Surroundings emerge as the most important dimensions to decisionmaking.
The situation of Host and Check-in/Check-out is mixed because several participants consider them as important or very important but a few ones declare that they are unimportant or little important. As far as the visualization of information is concerned, the amenities are considered as important more frequently than the bar graphs. This is probably due to the fact that people want to be sure that the selected homes offer the features they care about.

A. DISCUSSION OF RESULTS
The user study provides interesting findings about the objective and perceived performance of the models we tested regarding both the recommendation of items, and the visualization of results.
The recommendation performance measures show that STAGES, that relies exclusively on the evaluation of user experience in the stages of item fruition, achieves the best results concerning the ranking of items. This finding suggests that the experience evaluation dimensions are a precious summary for the identification of relevant items. FEATURES achieves the best results regarding the minimization of error in rating estimation. However, this is a secondary finding because our first goal is that of promoting good items in the recommendation lists.
Notice that the recommender systems that use a single type of information, i.e., either user experience data (STAGES), or features (CBF, and FEATURES), received some opting outs from participants. Differently, when users interacted with the systems that combine these types of information (CBF-STAGES, FEATURES-STAGES), they were able to This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  evaluate all the suggested items. This is a first indication that the joint presentation of data about item features and user experience enhances users' confidence in item evaluation. FEATURES-STAGES, which combines these two types of information, is the second-best algorithm for rating estimation and obtains fairly good NDCG results. As far as our first research question (RQ1) is concerned, these findings support the hypothesis that the integration of a service-based representation of items with data about their features improves recommendation accuracy. Indeed, we obtain the best results by only relying on service-based information about items (STAGES); however, in that case, some users do not feel confident in decision-making. Thus, a good compromise between recommendation quality and coverage is the integration of consumer experience and item features in the presentation of the suggestions, as done in FEATURES-STAGES. That system achieved the secondbest ranking performance and did not get any opting-outs.
To answer RQ2, we analyze participants' perceptions after having interacted with the systems. Regarding the Perceived Quality of Recommendations (Q), people perceived FEATURES as the model that generates the best suggestions. This is probably due to its coherence with respect to the user's requirements. In fact, that system recommends the homes that reflect the amenities marked as important during preference elicitation and it highlights them in bold in the presentation of results. However, FEATURES-STAGES is perceived as the best system regarding both Perceived User-Awareness Support (U) and Interface Adequacy (I), which describe users' comprehension of the rationale behind the recommendations, their awareness about the suggestions, and their confidence in decision-making. We explain this finding with the fact that by showing amenities, bar graphs, and item reviews, the system helps users analyze and compare candidate homes in a more efficacious way than by only presenting amenity data.
The Structural Equation Model confirms these results. The Interface Adequacy (I), has a positive effect on the Perceived User-Awareness Support (U) because by providing more data about items the system makes the user more confident about the available options to choose from. Moreover, the Perceived User-Awareness Support (U) positively influences the Perceived Quality of Recommendations (Q) because, to perceive the suggestions as good ones, the user needs a sufficient amount of data about items.
The results of the post-test questionnaire show that participants considered the visualization of data about the amenities offered by the homes as more important than the bar graphs summarizing consumer experience. However, by jointly considering these results, and the fact that FEATURES-STAGES is recognized as the algorithm providing the highest userawareness support, we conclude that both offered amenities, and data extracted from consumer feedback, are key to decision-making.
Given all these findings, we can positively answer research question RQ2: if a recommender system presents both item features and service-based data in the suggestion lists, it enhances users' awareness about the available options, as well as their confidence in decision-making. The reason is that it provides people with complete information to evaluate items from the viewpoint of their features and of the other aspects concerning item fruition.

B. THEORETICAL IMPLICATIONS
This work advances the state of the art in recommender systems and in particular of review-based and aspect-based ones ( [13], [20], [28], [31], [33], [42], [53]) by integrating service models in rating estimation and in the presentation of results. Review-based recommender systems use consumer experience about items to integrate metadata with aspects extracted from online reviews. However, they extract itemcentric data which fail to overview the expected experience at fruition time. Differently, we model these stages and we group the aspects extracted from consumer feedback in evaluation dimensions aimed at separately measuring user experience. This approach makes it possible to weight aspects in different ways, depending on the importance of the individual evaluation dimensions to the user. Moreover, it supports the summarization of previous consumers' experience to enhance item evaluation and presentation within a recommendation list. The user study we carried out showed that our approach enhances recommendation performance, user-awareness about the suggested options, and users' confidence in item-selection decisions.

VIII. LIMITATIONS AND FUTURE WORK
The first limitation of our work concerns the number of participants involved in the user test. Even though the power analysis that we conducted suggests that this number is enough to obtain a robust statistical evaluation, we plan to test our systems with a larger number of users to increase the statistical power of the experiments. With a larger number of participants, we could develop, and test, service-based recommendation algorithms based on Collaborative Filtering such as the multi-criteria ones presented in [14] and [54].
The second limitation concerns the extraction of the aspects from the reviews. In this work, we leveraged the method described in [45] that uses dependency parsing to analyze textual information. This is a non-supervised opinion mining technique and does not require a large annotated dataset for training. However, it bases the match between aspects and dimensions on ad hoc dictionaries. We plan to extend our model with semantic Natural Language Processing techniques to extract aspects, and their synonyms, using standard language resources.
We also plan to investigate other models to define the service fruition stages and the dimensions for the evaluation of experience that underlie recommendation. So far, we leveraged the largely used Service Journey Maps. However, other approaches, such as the Service Blueprints [55], can be used to develop finer-grained service models. We also plan to test our recommender systems on the sales of experience products to assess their applicability to heterogeneous items. The specification of a new application domain is supported by the existence of service models that can be adapted to the peculiarities of the selected domain.

IX. ETHICAL ISSUES
In planning the user study we complied with literature guidelines on controlled experiments 4 [56]. Through the user interface of our test application, participants were informed about their rights: • the right to stop participating in the experiment, possibly without giving a reason; • the right to obtain further information about the purpose, and the outcomes of the experiment; • the right to have their data anonymized. As described in Section V, before starting the experiment, participants were asked to: (i) read a consent form, stating the nature of the experiment and their rights, (ii) confirm that they had read and understood their rights by clicking on the user interface of the test application, and (iii) confirm that they were 18 years old or over. Every participant was given the same instructions before the experimental tasks.
We did not store participants' names. During the user study, and the analysis of its results, we worked with anonymous codes.

X. CONCLUSIONS
In this paper, we pointed out that current recommender systems use item-centric data to estimate ratings and to present their results. Even though review-based recommender systems extract aspects from consumer feedback, they overlook the user experience during all the stages of item fruition, which is key to decision-making.
In order to address this limitation, we investigated the integration of recommender systems with service modeling to explicitly represent the evaluation dimensions of consumer experience during the stages of item fruition. Building on existing analyses of user experience with items, we developed different recommendation models that employ item features, experience evaluation dimensions and their combination to recommend and holistically present items to the user. The novelty of our approach is that we group the aspects of items extracted from reviews based on these evaluation dimensions, around which we organize preference modeling, recommendation, and information presentation. This enables us to steer the suggestions to the user's preferences for all such dimensions, and to summarize the user experience with items enhancing the identification of relevant options within the recommendation lists. In a user study, we found that, compared to state of the art recommender systems, our approach enhances recommendation performance, userawareness about items and confidence in decision-making. These findings encourage the adoption of service-based models in recommender systems research.