IEEE Quick Preview
  • Abstract

SECTION I

INTRODUCTION

Big data has emerged as a widely recognized trend, attracting attentions from government, industry and academia [1]. Generally speaking, Big Data concerns large-volume, complex, growing data sets with multiple, autonomous sources. Big Data applications where data collection has grown tremendously and is beyond the ability of commonly used software tools to capture, manage, and process within a “tolerable elapsed time” is on the rise [2]. The most fundamental challenge for the Big Data applications is to explore the large volumes of data and extract useful information or knowledge for future actions [3].

With the prevalence of service computing and cloud computing, more and more services are deployed in cloud infrastructures to provide rich functionalities [4]. Service users have nowadays encounter unprecedented difficulties in finding ideal ones from the overwhelming services. Recommender systems (RSs) are techniques and intelligent applications to assist users in a decision making process where they want to choose some items among a potentially overwhelming set of alternative products or services. Collaborative filtering (CF) such as item- and user-based methods are the dominant techniques applied in RSs [5]. The basic assumption of user-based CF is that people who agree in the past tend to agree again in the future. Different with user-based CF, the item-based CF algorithm recommends a user the items that are similar to what he/she has preferred before [6]. Although traditional CF techniques are sound and have been successfully applied in many e-commerce RSs, they encounter two main challenges for big data application: 1) to make decision within acceptable time; and 2) to generate ideal recommendations from so many services. Concretely, as a critical step in traditional CF algorithms, to compute similarity between every pair of users or services may take too much time, even exceed the processing capability of current RSs. Consequently, service recommendation based on the similar users or similar services would either lose its timeliness or couldn’t be done at all. In addition, all services are considered when computing services’ rating similarities in traditional CF algorithms while most of them are different to the target service. The ratings of these dissimilar ones may affect the accuracy of predicted rating.

A naïve solution is to decrease the number of services that need to be processed in real time. Clustering are such techniques that can reduce the data size by a large factor by grouping similar services together. Therefore, we propose a Clustering-based Collaborative Filtering approach (ClubCF), which consists of two stages: clustering and collaborative filtering. Clustering is a preprocessing step to separate big data into manageable parts [7]. A cluster contains some similar services just like a club contains some like-minded users. This is another reason besides abbreviation that we call this approach ClubCF. Since the number of services in a cluster is much less than the total number of services, the computation time of CF algorithm can be reduced significantly. Besides, since the ratings of similar services within a cluster are more relevant than that of dissimilar services [8], the recommendation accuracy based on users’ ratings may be enhanced.

The rest of this paper is organized as follows. In Section II, A service BigTable is designed for storage requirement of ClubCF. It recruits BigTable and is capable of storing service-relevant big data in distribute and scalable manner. In Section III, ClubCF approach is described in detail step by step. First, characteristic similarities between services are computed by weighted sum of description similarities and functionality similarities. Then, services are merged into clusters according to their characteristic similarities. Next, an item-based CF algorithm is applied within the cluster that the target service belongs to. In Section IV, several experiments are conducted on a real dataset extracted from Programmable Web (http://www.programmableweb.com). Related work is analyzed in Section V. At last, we draw some conclusions and present some future work in Section VI.

SECTION II

PRELIMINARY KNOWLEDGE

To measure the similarity between Web services, Liu et al. [9] investigated the metadata from the WSDL (Web Service Description Language) files and defined a Web service as Formula$=\langle N, M, D, O\rangle$, where Formula$N$ is the name that specifies a Web service, Formula$M$ is the set of messages exchanged by the operation invocation, Formula$D$ is the set of data types, and Formula$O$ is the set of operations provided by the Web service. From the definition, three types of metadata from WSDL can be identified for similarity matching: the plain textual descriptions, the operation that captures the purposed functionality and the data type relate to the semantic meanings. For evaluating reputation of service, Li et al. [10] defined a Web service as Formula$WS(id,d,t,sg,rs,dor)$ where Formula$id$ is its identity, Formula$d$ is its text description, Formula$t$ is its classification, Formula$sg$ denotes the level of its transaction volume, Formula$rs$ is its review set, and Formula$dor$ is its reputation degree. In the SOA Solution Stack (S3) [11] proposed by IBM, a service is defined as an abstract specification of one or more business-aligned IT functions. This specification provides consumers with sufficient information to be able to invoke the business functions exposed by a service provider.

Although the definitions of service are distinct and application-specific, they have common elements which mainly include service descriptions and service functionalities. In addition, rating is an important user activity that reflects their opinions on services. Especially in application of service recommendation, service rating is an important element. As more and more services are emerging on the Internet, such huge volume of service-relevant elements are generated and distributed across the network, which cannot be effectively accessed by traditional database management system. To address this problem, Bigtable is used to store services in this paper. Bigtable [12] is a distributed storage system of Google for managing structured data that is designed to scale to a very large size across thousands of commodity servers. A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. Column keys are grouped into sets called column families, which form the basic unit of access control. A column key is named using the following syntax: family:qualifier, where ‘family’ refers to column family and ‘qualifier’ refers to column key. Each cell in a Bigtable can contain multiple versions of the same data which are indexed by timestamp. Different versions of a cell are stored in decreasing timestamp order, so that the most recent versions can be read first.

In this paper, all services are stored in a Bigtable which is called service Bigtable. The corresponding elements will be drawn from service Bigtable during the process of ClubCF. Formally, service Bigtable is defined as follow.

Definition 1:

A service Bigtable is defined as a table expressed in the format of Formula$<Service\_{}ID>\thinspace <\!Timestamp\!>$FormulaTeX Source$$\begin{align*} \{\!\!&<Description>:[<d_{1}>,<d_{2}>,\ldots ];\\ &<Functionality>[:<f_{1}>,<f_{2}>,\ldots ];\\ &<Rating>:[<u_{1}>, <u_{2}>,\ldots ]\} \end{align*}$$ The elements in the expression are specified as follows:

  1. Formula$Service\_{}ID$ is the row key for uniquely identifying a service.
  2. Formula$Timestamp$ is used to identify time when the record is written in service Bigtable.
  3. Formula$Description, Functionality$ and Formula$Rating$ are three column families.
  4. The identifier of a description word, e.g. Formula$d_{1}$ and Formula$d_{2}$, is used as a qualifier of Formula$Description$.
  5. The identifier of a functionality e.g. Formula$f_{1}$ and Formula$f_{2}$ is used as a qualifier of Formula$Functionality$.
  6. The identifier of a user, e.g. Formula$u_{1}$ and Formula$u_{2}$ is used as a qualifier of Formula$Rating$.

A slice of service Bigtable is illustrated in Table I. The row key is Formula$s_{1}$. The Description column family contains the words for describing Formula$s_{1}$, e.g. “driving”. The Functionality column family contains the service functionalities, e.g., “Google maps”. And the Rating column family contains the ratings given by some users at different time, e.g., “4” is a rating that “Formula$u_{1}$” gave to “Formula$s_{1}$” at timestamp “Formula$t_{6}$”.

Table 1
Table 1. A slice of a service bigtable.
SECTION III

A CLUSTERING-BASED COLLABORATIVE FILTERING APPROACH (CLUBCF) FOR BIG DATA APPLICATION

According to Definition 1, a service could be expressed as a triple, Formula$s=( D,F,R )$, where Formula$D$ is a set of words for describing Formula$s$, Formula$F$ is a set of functionalities of Formula$s$, Formula$R$ is a set of ratings some users gave to Formula$s$. Five kinds of service similarities are computed based on Formula$D,\thinspace F$ and Formula$R$ during the process of ClubCF, which are defined as follow.

Definition 2:

Suppose Formula$s_{t}=\langle D_{t},F_{t},R_{t}\rangle$ and Formula$s_{j}=\langle D_{j},F_{j},R_{j}\rangle$ are two services. The similarity between Formula$s_{t}$ and Formula$s_{j}$ is considered in five dimensions which are description similarity Formula$D\_{}sim( s_{t},s_{j} )$, functionality similarity Formula$F\_{}sim( s_{t},s_{j} )$, characteristic similarity Formula$C\_{}sim( s_{t},s_{j} )$, rating similarity Formula$R\_{}sim( s_{t},s_{j} )$ and enhanced rating similarity Formula${R\_{}sim}^{'}( s_{t},s_{j} )$, respectively.

With this assumption, a ClubCF approach for Big Data application is presented, which aims at recommending services from overwhelming candidates within an acceptable time. Technically, ClubCF focuses on two interdependable stages, i.e., clustering stage and collaborative filtering stage. In the first stage, services are clustered according to their characteristic similarities. In the second stage, a collaborative filtering algorithm is applied within a cluster that a target service belongs to.

Concretely, Fig. 1 depicts the specification of the ClubCF approach step by step.

Figure 1
Figure 1. Specification of the ClubCF Approach.

A. Deployment of Clustering Stage

Step 1.1: Stem Words

Different developers may use different-form words to describe similar services. Using these words directly may influence the measurement of description similarity. Therefore, description words should be uniformed before further usage. In fact, morphological similar words are clubbed together under the assumption that they are also semantically similar. For example, “map”, “maps”, and “mapping” are forms of the equivalent lexeme, with “map” as the morphological root form. To transform variant word forms to their common root called stem, various kinds of stemming algorithms, such as Lovins stemmer, Dawson Stemmer, Paice/Husk Stemmer, and Porter Stemmer, have been proposed [13]. Among them, Porter Stemmer (http://tartarus.org/martin/PorterStemmer/) is one of the most widely used stemming algorithms. It applies cascaded rewrite rules that can be run very quickly and do not require the use of a lexicon [14].

In ClubCF approach, the words in Formula$D_{t}$ are gotten from service Bigtable where row Formula$key=``s_{t}$” and column family = “Description”. The words in Formula$D_{j}$ are gotten from service Bigtable where row Formula$key=``s_{j}$” and column family Formula$= ``Description$”. Then these words are stemmed by Porter Stemmer and put into Formula$D_{t}^{'}$ and Formula$D_{j}^{'}$, respectively.

Step 1.2: Compute Description Similarity and Functionality Similarity

Description similarity and functionality similarity are both computed by Jaccard similarity coefficient (JSC) which is a statistical measure of similarity between samples sets [15]. For two sets, JSC is defined as the cardinality of their intersection divided by the cardinality of their union. Concretely, description similarity between Formula$s_{t}$ and Formula$s_{j}$ is computed by formula (1): FormulaTeX Source$$\begin{equation} D\_{}sim( s_{t},s_{j} )=\frac {\big | D_{t}^{'}\bigcap D_{j}^{'} \big |}{\big | D_{t}^{'}\bigcup D_{j}^{'} \big |} \end{equation}$$

It can be inferred from this formula that the larger Formula$\big | D_{t}^{'}\bigcap D_{j}^{'} \big |$ is, the more similar the two services are. Formula$\big | D_{t}^{'}\bigcup D_{j}^{'} \big |$ is the scaling factor which ensures that description similarity is between 0 and 1.

The functionalities in Formula$F_{t}$ are gotten from service Bigtable where row Formula$key=``s_{t}$” and column family = “Functionality”. The functionalities in Formula$F_{j}$ are gotten from service Bigtable where row Formula$key=``s_{j}$” and column family = “Functionality”. Then, functionality similarity between Formula$s_{t}$ and Formula$s_{j}$ is computed using JSC as follow:FormulaTeX Source$$\begin{equation} F\_{}sim( s_{t},s_{j} )=\frac {\big | F_{t}\bigcap F_{j} \big |}{\big | F_{t}\bigcup F_{j} \big |} \end{equation}$$

Step 1.3: Compute Characteristic Similarity

Characteristic similarity between Formula$s_{t}$ and Formula$s_{j}$ is computed by weighted sum of description similarity and functionality similarity, which is computed as follow:FormulaTeX Source$$\begin{equation} C\_{}sim(s_{t},s_{j} )=\alpha \times D\_{}sim(s_{t},s_{j} )+\beta \times F\_{}sim(s_{t},s_{j} )\quad \end{equation}$$

In this formula, Formula$\alpha \in [{ 0,1 }]$ is the weight of description similarity, Formula$\beta \mathrm {\in }[ \mathrm {0,1} ]$ is the weight of functionality similarity and Formula$\alpha +\beta =1$. The weights express relative importance between these two.

Provided the number of services in the recommender system is Formula$n$, characteristic similarities of every pair of services are calculated and form a Formula$n\times n$ characteristic similarity matrix Formula$D$. An entry Formula$d_{t,j}$ in Formula$D$ represents the characteristic similarity between Formula$s_{t}$ and Formula$s_{j}$.

Step 1.4: Cluster Services

Clustering is a critical step in our approach. Clustering methods partition a set of objects into clusters such that objects in the same cluster are more similar to each other than objects in different clusters according to some defined criteria.

Generally, cluster analysis algorithms have been utilized where the huge data are stored [16]. Clustering algorithms can be either hierarchical or partitional. Some standard partitional approaches (e.g., Formula$K-$ means) suffer from several limitations: 1) results depend strongly on the choice of number of clusters Formula$K$, and the correct value of Formula$K$ is initially unknown; 2) cluster size is not monitored during execution of the Formula$K$-means algorithm, some clusters may become empty (“collapse”), and this will cause premature termination of the algorithm; 3) algorithms converge to a local minimum [17]. Hierarchical clustering methods can be further classified into agglomerative or divisive, depending on whether the clustering hierarchy is formed in a bottom-up or top-down fashion. Many current state-of-the-art clustering systems exploit agglomerative hierarchical clustering (AHC) as their clustering strategy, due to its simple processing structure and acceptable level of performance. Furthermore, it does not require the number of clusters as input. Therefore, we use an AHC algorithm [18], [19] for service clustering as follow.

Assume there are Formula$n$ services. Each service is initialized to be a cluster of its own. At each reduction step, the two most similar clusters are merged until only Formula$K( K<n )$ clusters remains.

Algorithm 1 AHC algorithm for service clustering

Algorithm 1

B. Deployment of Collaborative Filtering Stage

Up to now, item-based collaborative filtering algorithms have been widely used in many real world applications such as at Amazon.com. It can be divided into three main steps, i.e., compute rating similarities, select neighbors and recommend services.

Step 2.1: Compute Rating Similarity

Rating similarity computation between items is a time-consuming but critical step in item-based CF algorithms. Common rating similarity measures include the Pearson correlation coefficient (PCC) [20] and the cosine similarity between ratings vectors. The basic intuition behind PCC measure is to give a high similarity score for two items that tend to be rated the same by many users. PCC which is the preferred choice in most major systems was found to perform better than cosine vector similarity [21]. Therefore, PCC is applied to compute rating similarity between each pair of services in ClubCF. Provided that service Formula$s_{t}$ and Formula$s_{j}$ are both belong to the same cluster, PCC-based rating similarity [22] between Formula$s_{t}$ and Formula$s_{j}$ is computed by formula (4): FormulaTeX Source$$\begin{align} &R\_{}sim\big ( s_{t},s_{j}\big ) \notag\\ &\hspace {0.2pc}=\displaystyle \frac {\sum \nolimits _{u_{i}\in U_{t}\bigcap U_{j}} {\big ( r_{u_{i},s_{t}}-\overline {r}_{s_{t}} \big )\big ( r_{u_{i},s_{j}}-\overline {r}_{s_{j}} \big )} } {\sqrt {\sum \nolimits _{u_{i}\in U_{t}\bigcap U_{j}} {\big ( r_{u_{i},s_{t}}-\overline {r}_{s_{t}} \big )^{2}}} \sqrt { \sum \nolimits _{u_{i}\in U_{t}\bigcap U_{j}} {\big ( r_{u_{i},s_{j}}-\overline {r}_{s_{j}} \big )}^{2}}} \notag\\\text{}\end{align}$$Here, Formula$U_{t}$ is a set of users who rated Formula$s_{t}$ while Formula$U_{j}$ is a set of users who rated Formula$s_{j}$, Formula$u_{i}$ is a user who both rated Formula$s_{t}$ and Formula$s_{j}$, Formula$r_{u_{i},s_{t}}$ is the rating of Formula$s_{t}$ given by Formula$u_{i}$ which is gotten from service Bigtable where row key Formula$=$Formula$s_{t}$ ” and column key = “ Formula$Rating:u_{i}$, ” Formula$r_{u_{i},s_{j}}$ is the rating of Formula$s_{j}$ given by Formula$u_{i}$ which is gotten from service Bigtable where row key = “Formula$s_{j}$” and column key = “Formula$Rating:u_{i}$, ” Formula$\overline r_{s_{t}}$ is the average rating of Formula$s_{t}$, and Formula$\overline r_{s_{j}}$ is the average rating of Formula$s_{j}$. It should be noted that if the denominator of formula (4) is zero, we make 0, in order to avoid division by 0.

Although PCC can provide accurate similarity computation, it may overestimate the rating similarities when there are a small amount of co-rated services. To address this problem, the enhanced rating similarity [23] between Formula$s_{t}$ and Formula$s_{j}$ is computed by formula (5):FormulaTeX Source$$\begin{equation} {R\_{}sim}^{\prime }(s_{t},\thinspace s_{j} )=\frac {2\times | U_{t}\cap U_{j} |}{| U_{t} |+| U_{j} |}\times R\_{}sim(s_{t},s_{j}) \end{equation}$$

In this formula, Formula$| U_{t}\cap U_{j} |$ is the number of users who rated both service Formula$s_{t}$ and Formula$s_{j}$, Formula$| U_{t} |$ and Formula$| U_{j} |$ are the number of users who rated service Formula$s_{t}$ and Formula$s_{j}$, respectively. When the number of co-rated services is small, for example, the weight Formula$\frac {2\times | U_{t}\cap U_{j} |}{| U_{t} |+| U_{j} |}$ will decrease the rating similarity estimation between these two users. Since the value of Formula$\frac {2\times | U_{t}\cap U_{j} |}{| U_{t} |+| U_{j} |}$ is between the interval of [0, 1] and the value of Formula$R\_{}sim( s_{t},s_{j} )$ is in the interval of [Formula$-1,1$], the value of Formula${R\_{}sim}^{'}(s_{t},s_{j} )$ is also in the interval of [Formula$-1,1$].

Step 2.2: Select Neighbors

Based on the enhanced rating similarities between services, the neighbors of a target service Formula$s_{t}$ are determined according to constraint formula (6):FormulaTeX Source$$\begin{equation} N(s_{t})=\big \{ {s_{j}}\vert {{R\_{}sim}^{'}(s_{t},s_{j} )>\gamma ,s_{t}\ne s_{j}}\big \} \end{equation}$$ Here, Formula${R\_{}sim}^{'}(s_{t},s_{j})$ is the enhanced rating similarity between service Formula$s_{t}$ and Formula$s_{j}$ computed by formula (5), Formula$\gamma$ is a rating similarity threshold. The bigger value of Formula$\gamma$ is, the chosen number of neighbors will relatively less but they may be more similar to the target service, thus the coverage of collaborative filtering will decrease but the accuracy may increase. On the contrary, the smaller value of Formula$\gamma$ is, the more neighbors are chosen but some of them may be only slightly similar to the target service, thus the coverage of collaborative filtering will increase but the accuracy would decrease. Therefore, a suitable Formula$\gamma$ should be set for the tradeoff between accuracy and coverage. While Formula$\gamma$ is assigned, Formula$s_{j}$ will be selected as a neighbor of Formula$s_{t}$ and put into the neighbor set Formula$N(s_{t} )$ if Formula${R\_{}sim}^{'}(s_{t},s_{j} )> \gamma$.

Step 2.3: Compute Predicted Rating

For an active user Formula$u_{a}$ for whom predictions are being made, whether a target service Formula$s_{t}$ is worth recommending depends on its predicted rating. If Formula$N(s_{t})\ne \Phi$, similar to the computation formula proposed by Wu et al. [24], the predicted rating Formula$P(u_{a}s_{t})$ in an item-based CF is computed as follow:FormulaTeX Source$$\begin{equation} P_{u_{a},s_{t}} = \overline {r}_{s_{t}}+\frac {\sum \nolimits _{s_{j}\in N(s_{t})} { \big ( r_{u_{a},s_{j}}-\overline {r}_{s_{j}} \big )\times {R\_{}sim}^{'}\big (s_{t}, s_{j} \big )} }{\sum \nolimits _{s_{j}\in N(s_{t})} {{R\_{}sim}^{'}\big (s_{t}, s_{j} \big )}}\quad \end{equation}$$ Here, Formula$\overline {r}_{s_{t}}$ is the average rating of Formula$s_{t}$, Formula$N(s_{t})$ is the neighbor set of Formula$s_{t}$, Formula$s_{j}\in N(s_{t})$ denotes Formula$s_{j}$ is a neighbor of the target service Formula$s_{t}$, Formula$r_{u_{a},s_{j}}$ is the rating that an active user Formula$u_{a}$ gave to Formula$s_{j}$, Formula$\overline {r}_{s_{j}}$ is the average rating of Formula$s_{j}$, and Formula${R\_{}sim}^{'}(s_{t}, s_{j} )$ is the enhanced rating similarity between service Formula$s_{t}$ and Formula$s_{j}$ computed using formula (5).

If the predicted rating of a service exceeds a recommending threshold, it will be a recommendable service for the active user. A service is generally rated on a five-point scale from 1 (very dissatisfied) to 5 (very satisfied). Therefore, we set the recommending threshold to 2.5 which is the median value of the max rating. All recommendable services are ranked in non-ascending order according to their predicted ratings so that users may discover valuable services quickly.

C. Time Complexity Analysis

The time complexity of ClubCF can be divided into two parts: 1) the offline cluster building; and 2) the online collaborative filtering.

There are two main computationally expensive steps in the AHC algorithm. The first step is the computation of the pairwise similarity between all the services. Provided the number of services in the recommender system is Formula$n$, the complexity of this step is generally OFormula$(n^{2})$. The second step is the repeated selection of the pair of most similar clusters or the pair of clusters that best optimizes the criterion functionality. A naive way of performing this step is to re-compute the gains achieved by merging each pair of clusters after each level of the agglomeration, and select the most promising pair. During the Formula$l$th agglomeration step, this will require OFormula$((n-l )^{2} )$ time, leading to an overall complexity of OFormula$(n^{3})$. Fortunately, if the priority queue is implemented using a binary heap, the total complexity of delete and insert operations is OFormula$((n-l )\log (n-l))$. The overall complexity over the Formula$n-1$ agglomeration steps is OFormula$(n^{2}logn)$ [25].

Suppose there are Formula$m$ users and Formula$n$ services. The relationship between users and services is denoted by a Formula$m\times n$ matrix. Each entry Formula$r_{i,j}$ represents the rating of the user Formula$u_{i}$ on the service Formula$s_{j}$. Then the time complexity of PCC-based item similarity measures is OFormula$(nm^{2} )$ [26]. Fortunately, the number of service in a cluster is much less than the whole number of services. Suppose that the number of services in a cluster Formula$C_{k}$ is Formula$n_{k}$ and the number of users who rated at least one service in Formula$C_{k}$ is Formula$m_{k}$, then the time complexity of similarity computation is OFormula$(n_{k}m_{k}^{2} )$. If the number of the target service’s neighbors reaches to the max value, the worst-case time complexity of item-based prediction is OFormula$(n_{k} )$. Since Formula$n_{k}\ll n$ and Formula$m_{k}\ll m$, the cost of computation of ClubCF may decrease significantly. Through the analysis above, it can be inferred that ClubCF may meet the demand of real-time recommendation to some extent.

SECTION IV

EXPERIMENTS AND EVALUATION

A. Experimental Background

To verify ClubCF, a mashup dataset is used in the experiments. Mashup is an ad hoc composition technology of Web applications that allows users to draw upon content retrieved from external data sources to create value-added services [27]. Compared to traditional “developer-centric” composition technologies, e.g., BPEL (Business Process Execution Language) and WSCI (Web Service Choreography Interface), mashup provides a flexible and easy-of-use way for service composition on web [28]. Recently, “mashup” has become one of the hottest buzzwords in the area of web applications, and many companies and institutions provide various mashup solutions or re-label existing integration solutions as mashup tools. For example, HousingMaps (http://www.housingmaps.com) combines property listings from Craigslist (http://www.craigslist.org/) with map data from Google Maps (http://maps.google.com/) in order to assist people moving from one city to another and searching for housing offers. More interesting mashup services include Zillow (http://www.zillow.com/) and SkiBonk (http://www.skibonk.com/).

Manual mashup development requires programming skills and remains an intricate and time consuming task, which prevents the average user from programming own mashup services. To enable even inexperienced end users to mashup their web services, many mashup-specific development tools and frameworks have emerged [29]. The representative approaches of end user mashup tools include Google Mashup Editor, Yahoo Pipes, Microsoft Popfly, Intel Mash Maker, and IBM’s QEDWiki [30]. These tools speed up the overall mashup development process, resulting in an explosion in the amount of mashup services available on the Internet. Meanwhile, a large number of mashup services are similar to each other, in their components and in the logic [31]. Over mashup-oriented big data, ClubCF is a suitable approach for recommending ideal mashup services for users.

The data for our experiments was collected from ProgrammableWeb, a popular online community built around user-generated mashup services. It provides the most characteristic collection [32]. The extracted data was used to produce datasets for the population of mashup services. The dataset included mashup service name, tags, and APIs used. As of Dec 2012, 6,225 mashup services and related information are crawled from this site, which are labeled with 20,936 tags among which 1,822 tags are different. And, 15,450 APIs are used by these mashup services among which 1,499 APIs are different in name. The tags are stemmed using Porter Stemmer algorithm and 1,608 different stems of tags are obtained.

Since there are very few ratings available by now, we generate pseudorandom integers in the range 0 to 5 as the ratings of mashup services. Assume there are 500 users that have rated some mashup services published on the website. Then the user-item matrix consists of 500 rows and 6,225 columns. In total, 50,000 non-zero ratings are generated. The sparsity level of the matrix is 98.39% (sparsity level = 1-50000/500 * 6226 = 0.9839).

We add an empirical evaluation based on a well known statistical test, namely the Formula$l$-fold cross validation [33]. The ratings records are split into Formula$l$ mutually exclusive subsets (the folds) of equal size. During each step, it is tested on fold and trained on the rest. The cross-validation process is then repeated Formula$l$ times, with each of the Formula$l$ subsets used exactly once as the validation data. In this paper, 5-fold cross validation is applied (i.e., Formula$l=5$). In order to distribute test data and training data over all clusters, 20% services of each cluster was included in test data and 80% of it was included in training data for each data split.

B. Experimental Environments

The experiments are conducted in a hybrid network environment that consists of three local hosts. The mashup services recruited in the experiment are distributed among the three hosts. As listed in Table II, the cluster (HDFS: http://114.212.190.91:50070, and JobTracker: http://114.212.190.91:50030, Campus Network) consists of an 18 node cluster, with one master node and 17 slave nodes. Each node is equipped with two Intel(R) Quad Core E5620 Xeon(R) processors at 2.4GHz and 24GB RAM. For the master node, a 2TB disk is mounted while, for each slave node, two 2TB disks are equipped. The cluster runs under Redhat Enterprise Linux Server 6.0, Java 1.6.0 and Hadoop-0.20.205.0. For the private cloud (http://cs-cloud.nju.edu.cn), there is totally 20TB storage capacity in the system. A user can apply for an 8-core processor, 20G memory and 100GB disk [34].

Table 2
Table 2. The experimental environments.
SECTION V

THE EXPERIMENT ENVIRONMENTS

A. Experimental Case Study

According to ClubCF, experimental process is promoted by two stages as we specified in Section III: Clustering stage and collaborative filtering stage.

In the first stage, characteristic similarities between mashup services are first computed. Then, all mashup services are merged into Formula$K$ clusters using Algorithm 1. In the second stage, rating similarities between mashup services that belong to the same cluster are computed. As PCC may overestimate the rating similarities, enhanced rating similarities are calculated. Then some mashup services whose enhanced rating similarities with the target mashup service exceed a threshold are selected as neighbors of the target mashup service. At last, the predicted rating of the target mashup service is computed.

1) Deployment of Clustering Stage

Step 1.1: Stem Words

Generally, a mashup service Formula$s_{i}$ is described with some tags and functionalize with some APIs [35]. As an experimental case, seven concrete mashup services (i.e., Formula$s_{1},s_{2},s_{3},s_{4},s_{5},s_{6}$ and Formula$s_{7})$ the corresponding tags and APIs are listed in Table III. APIs of Formula$s_{i}$ are put into Formula$F_{i}$, tags of Formula$s_{i}$ are put into Formula$D_{i}$. Tags in Formula$D_{i}$ are stemmed using Porter stemmer and put into Formula$D_{i}^{'}$.

Table 3
Table 3. Case of mahsup services.

Step 1.2: Compute Description Similarity and Functionality Similarity

Description similarities between mashup services are computed using formula (1). For instance, there are one same stemmed tag (i.e., “book”) among the six different stemmed tags in Formula$D_{2}$ and Formula$D_{5}$, therefore, Formula$D\_{}sim( s_{2},s_{5} )=\frac {| D_{2}^{'}\bigcap D_{5}^{'} |}{| D_{2}^{'}\bigcup D_{5}^{'} |}=\frac {1}{6}$.

Functionality similarities between mashup services are computed using formula (2). Since there is only one API (i.e., “Amazon Product Advertising”) in Formula$F_{2}$ and Formula$F_{5}$, Formula$F\_{}sim(s_{2},s_{5} )=\frac {| F_{2}\bigcap F_{5} |}{| F_{2}\bigcup F_{5} |}=1$.

Step 1.3: Compute Characteristic Similarity

Characteristic similarity is the weight sum of the description similarity and functionality similarity, which is computed using formula (3). Without loss of generality, the weight of description similarity Formula$\alpha$ is set to 0.5. Then the characteristic similarity between Formula$\mathrm {s}_{2}$ and Formula$\mathrm {s}_{5}$ is computed as Formula$C\_{}sim( s_{2},s_{5} )=\alpha$ × Formula$D\_{}sim( s_{2},s_{5} )+( 1-\alpha )\times F\_{}sim( s_{2},s_{5} )=0.5\times \frac {1}{6}+0.5\times 1\cong 0.583$. It should be noted that all the computation results retain 3 digits after the decimal point, thereafter.

Characteristic similarities between the seven mashup services are all computed by the same way, and the results are shown in Table IV.

Table 4
Table 4. Characteristic similarity matrix (keeping three decimal places).

Step 1.4: Cluster Services

In this step, Algorithm 1 is processed in the specified order. Initially, the seven services Formula$s_{1}{\sim }s_{7}$ are put into seven clusters Formula$C_{1}{\sim }C_{7}$ one by one and the characteristic similarities between each pair of services in Table IV are assigned to similarity of the corresponding clusters. The highlighted data in Table V is the maximum similarity in the similarity matrix.

Table 5
Table 5. Initial similarity matrix (Formula$k=7$).

The reduction step of Algorithm 1 is described as follows.

  1. Step 1.Search for the pair in the similarity matrix with the maximum similarity and merge them.
  2. Step 2.Create a new similarity matrix where similarities between clusters are calculated by their average value.
  3. Step 3.Save the similarities and cluster partitions for later visualization.
  4. Step 4.Proceed with 1 until the matrix is of size Formula$K$, which means that only Formula$K$ clusters remains.

Let Formula$K=3$ as the termination condition of Algorithm 1, the reduction steps are illustrated in Table VITable IX.

Table 6
Table 6. Algorithm 1: reduction step 1 (Formula$k=6$).
Table 7
Table 7. Algorithm 1 : reduction step 2(Formula$k=5$).
Table 8
Table 8. Algorithm 1 : reduction step 3 (Formula$k=4$).
Table 9
Table 9. Algorithm 1: reduction step 4 (Formula$k=3$).

As for reduction Step 1 as shown in Table VI, since the maximum similarity in the similarity matrix is Formula$d_{C_{2},C_{5}}$, Formula$C_{2}$ and Formula$C_{5}$ are merged into Formula$(C_{2},C_{5} )$. And the similarity between Formula$(C_{2},C_{5})$ and other clusters is calculated by their average value. For example, Formula$d_{(C_{2},C_{5} ),C_{3}}=( d_{C_{2},C_{3}}+d_{C_{5},C_{3}} )\mathord {\left / {\vphantom {\left ( d_{C_{2},C_{3}}+d_{C_{5},C_{3}} \right ) 2}} \right . } 2=(0+0.063 )\mathord {\left / {\vphantom {\left ( 0+0.063 \right ) 2}} \right . } 2\cong 0.032$.

As for reduction Step 2 as shown in Table VII, since the maximum similarity in the similarity matrix is Formula$d_{C_{3},C_{4}}$, Formula$C_{3}$ and Formula$C_{4}$ are merged into Formula$( C_{3},C_{4} )$. And the similarity between Formula$( C_{3},C_{4} )$ and other clusters is calculated by their average value. For example, Formula$d_{(C_{2},C_{5} ),(C_{3},C_{4} )}=(d_{(C_{2},C_{5} ),C_{3}}+d_{(C_{2},C_{5} ),C_{4}} )/2={(0.032+0.042)} / 2=0.037$.

As for reduction Step 3 as shown in Table VIII, since the maximum similarity in the similarity matrix is Formula$d_{C_{1},( C_{3},C_{4} )}$, Formula$C_{1}$ and Formula$( C_{3},C_{4} )$ are merged into Formula$( C_{1}, C_{3},C_{4} )$. And the similarity between Formula$(C_{1}, C_{3},C_{4} )$ and other clusters is calculated by their average value. For example, Formula$d_{(C_{1}, C_{3},C_{4} ),C_{6}}=(d_{C_{1},C_{6}}+d_{(C_{3},C_{4} ),C_{6}}) / 2=(0+0.042 )/ 2=0.021.$

As for reduction Step 4 as shown in Table IX, since the maximum similarity in the similarity matrix is Formula$d_{(C_{1},C_{3},C_{4} ),C_{7}}$, Formula$(C_{1},C_{3},C_{4} )$ and Formula$C_{7}$ are merged into Formula$(C_{1}, C_{3},C_{4},C_{7} )$. And the similarity between Formula$( C_{1}, C_{3},C_{4},C_{7} )$ and other clusters is calculated by their average value. For example, Formula$d_{( C_{1}, C_{3},C_{4},C_{7} ),( C_{2},C_{5} )}=( d_{( C_{1}, C_{3},C_{4} ),( C_{2},C_{5} )}+d_{C_{7},(C_{2},C_{5} )} )/2=(0.019+0.050) /2\cong$ 0.035.

Now, there are only 3 clusters remaining and the algorithm is terminated.

By using Algorithm 1, the seven mashup services are merged into three clusters, where Formula$s_{2}$ and Formula$s_{5}$ are merged into a cluster named Formula$C_{1}$, Formula$s_{1}s_{3}{,s}_{4}$ and Formula$s_{7}$ are merged into a cluster named Formula$C_{2}$, and Formula$s_{6}$ is separately merged into a cluster named Formula$C_{3}$.

2) Deployment of Collaborative Filtering Stage

Step 2.1: Compute Rating Similarity

Suppose there are four users (i.e., Formula$u_{1}u_{2}u_{3}u_{4}$) who rated the seven mashup services. A rating matrix is established as Table X. The ratings are on 5-point scales and 0 means the user did not rate the mashup. As Formula$u_{3}$ does not rate Formula$s_{4}$ (a not-yet-experienced item), Formula$u_{3}$ is regarded as an active user and Formula$s_{4}$ is looked as a target mashup. By computing the predicted rating of Formula$s_{4}$, it can be determined whether Formula$s_{4}$ is a recommendable service for Formula$u_{3}$. Furthermore, Formula$s_{1}$ is also chosen as another target mashup. Through comparing the predicted rating and real rating of Formula$s_{1}$, the accuracy of ClubCF will be verified in such case.

Table 10
Table 10. Rating matrix.

Since Formula$s_{4}$ and Formula$s_{1}$ are both belong to the cluster Formula$C_{2}$, rating similarity and enhanced rating similarity are computed between mashup services within Formula$C_{2}$ by using formula (4) and (5). The rating similarities and enhanced rating similarities between Formula$s_{4}$ and every other mashup service in Formula$C_{2}$ are listed in Table XI while such two kinds of similarities between Formula$s_{1}$ and every other mashup service in Formula$C_{2}$ are listed in Table XII.

Table 11
Table 11. Rating similarities and enhanced rating similarities with Formula$\mathrm {s}_{4}$.
Table 12
Table 12. Rating similarities and enhanced rating similarities with Formula$\mathrm {s}_{1}$.

Step 2.2: Select Neighbors

Rating similarity is computed using Pearson correlation coefficient which ranges in value from −1 to +1. The value of −1 indicates perfect negative correlation while the value of +1 indicates perfect positive correlation. Without loss of generality, the rating similarity threshold Formula$\gamma$ in formula (6) is set to 0.4.

Since the enhanced rating similarity between Formula$s_{4}$ and Formula$s_{1}$ is 0.467 ( i.e., Formula${R\_{}sim}^{'}( s_{4}, s_{1} )=0.467)$ and the enhanced rating similarity between Formula$s_{4}$ and Formula$s_{3}$ is 0.631 (i.e., Formula${R\_{}sim}^{'}(s_{4}, s_{3} )=0.631)$, which are both greater than Formula$\gamma$, Formula$s_{1}$ and Formula$s_{3}$ are chosen as the neighbors of Formula$s_{4}$, i.e., Formula$N( s_{4} )=\{ s_{1}\mathrm {,}s_{3} \}$.

Since the enhanced rating similarity between Formula$s_{1}$ and Formula$s_{3}$ is 0.839 (i.e., Formula${R\_{}sim}^{'}( s_{1}, s_{3} )=0.839)$ and the enhanced rating similarity between Formula$s_{1}$ and Formula$s_{4}$ is 0.467 (i.e., Formula${R\_{}sim}^{'}( s_{1}, s_{4} )=0.467)$, which are both greater than Formula$\gamma$, Formula$s_{3}$ and Formula$s_{4}$ are chosen as the neighbors of Formula$s_{1}$, i.e., Formula$N( s_{1} )=\{ s_{3}\mathrm {,}s_{4} \}$.

Step 2.3: Compute Predicted Rating

According to formula (7), the predicted rating of Formula$s_{4}$ for Formula$u_{3}$, i.e., Formula$P_{u_{3},s_{4}}=1.97$ and the predicted rating of Formula$s_{1}$ for Formula$u_{3}$, i.e., Formula$P_{u_{3},s_{1}}=1.06$.

Thus, Formula$s_{4}$ is not a good mashup service for Formula$u_{3}$ and will not be recommended to Formula$u_{3}$. In addition, as the real rating of Formula$s_{1}$ given by user Formula$u_{3}$ is 1 (see Table X) while its predicted rating is 1.06, it can be inferred that ClubCF may gain an accurate prediction.

B. Experimental Evaluation

To evaluate the accuracy of ClubCF, Mean Absolute Error (MAE), which is a measure of the deviation of recommendations from their true user-specified ratings, is used in this paper. As Herlocker et al. [36] proposed, MAE is computed as follow:FormulaTeX Source$$\begin{equation} \mathrm {MAE}=\frac {\sum \nolimits _{i=1}^{n} \big | r_{a,t}-P(u_{a},s_{t} )\big |}{n} \end{equation}$$ In this formula, Formula$n$ is the number of rating-prediction pairs, Formula$r_{a,t}$ is the rating that an active user Formula$u_{a}$ gives to a mashup service Formula$s_{t}$, Formula$P(u_{a}, s_{t})$ denotes the predicted rating of Formula$s_{t}$ for Formula$u_{a}$.

In fact, ClubCF is a revised version of traditional item-based CF approach for adapting to big data environment. Therefore, to verify its accuracy, we compare the MAE of ClubCF with a traditional item-based CF approach (IbCF) described in [26]. For each test mashup service in each fold, its predicted rating is calculated based on IbCF and ClubCF approach separately.

The mashup services published on ProgrammableWeb focus on six categories which labeled with keywords: “photo,” “google,” “flash,” “mapping,” “enterprise,” and “sms”. Therefore, without loss of generality in our experiment, the value of Formula$K$, which is the third input parameter of Algorithm 1, is set to 3, 4, 5, and 6, respectively. Furthermore, rating similarity threshold Formula$\gamma$ is set to 0.1, 0.2, 0.3 and 0.4. Under these parameter conditions, the predicted ratings of test services are calculated by ClubCF and IbCF. Then the average MAEs of ClubCF and IbCF can be computed using formula (8). The comparison results are shown in Fig. 2(a)–(d), respectively. There are several discoveries as follows.

  • While the rating similarity threshold Formula$\gamma <0.4$, MAE values of ClubCF decrease as the value of Formula$K$ increases. Since services are divided into more clusters, the services in a cluster will be more similar with each other. Furthermore, neighbors of a target service are chosen from the cluster that the target service belongs to. Therefore, these neighbors may be more similar to the target service. It results in more accurate prediction. In contrast, Formula$\gamma$ plays no role in IbCF because it is not considered in IbCF.
  • When Formula$\gamma <0.4$, MAE values of ClubCF and IbCF both decrease as the value of Formula$\gamma$ increases. It is due to that the neighbors will be more similar to the target service when the value of Formula$\gamma$ increases. It also results in the predicted ratings of target services computed according to the history ratings of the neighbors are approximate to their actual value.
  • When Formula$\gamma <0.4$, MAE values of ClubCF are lower than IbCF. In ClubCF, services are first clustered according to their characteristic similarities. And then rating similarities are measured between services in the same cluster. Since ratings of characteristic-similar services are more relevant with each other, it is more accurate to compute rating similarities between services in the same cluster. And the neighbors chosen based on the rating similarities are more similar to the target services. Consequently, the predicted ratings of the target services will be more precise than that of IbCF.
  • While Formula$\gamma =0.4$, MAE values of ClubCF and IbCF both increase. When Formula$k=5$ and Formula$k=6$, MAE values of ClubCF are even more than that of IbCF. By checking the intermediate results of these two approaches, we found that a lot of test services have few or even no neighbors when the rating similarity threshold is set to 0.4, especially when neighbors have to be selected from a smaller cluster. It results in comparatively large deviations between the predict ratings and the real ratings. With more sparse user-rating data, it will be more difficult to find high-similar services. Therefore, if sparsity is inevitable, rating similarity threshold should be adjusted dynamically according to the accuracy requirement of application or the sparsity of the dataset.
Figure 2
Figure 2. Comparison of MAE with IbCF and ClubCF. (a) Formula$\gamma =0.1$. (b) Formula$\gamma =0.2$. (c) Formula$\gamma =0.3$. (d) Formula$\gamma =0.4$.

In addition, to evaluate the efficiency of ClubCF, the online computation time of ClubCF is compared with that of IbCF, as shown in Fig. 3(a)–(d). There are several discoveries as follows.

  • In all, ClubCF spends less computation time than Item-based CF. Since the number of services in a cluster is fewer than the total number of services, the time of rating similarity computation between every pair of services will be greatly reduced.
  • As the rating similarity threshold Formula$\gamma$ increase, the computation time of ClubCF decrease. It is due to the number of neighbors of the target service decreases when Formula$\gamma$ increase. However, only when Formula$\gamma =0.4$, the decrease of computation time of IbCF is visible. It is due to the number of neighbors found from a cluster may less than that of found from all, and then it may spend less time on computing predicted ratings in ClubCF.
  • When Formula$\gamma =0.4$, as Formula$K$ increase, the computation time of ClubCF decrease obviously. Since a bigger Formula$K$ means fewer services in each cluster and a bigger Formula${\gamma }$ makes less neighbors, the computation time of predicted ratings based on less neighbors may decrease.
Figure 3
Figure 3. Comparison of Computation Time with ClubCF and IbCF. (a) Formula$\gamma =0.1$. (b) Formula$\gamma =0.2$. (c) Formula$\gamma =0.3$. (d) Formula$\gamma =0.4$.

According to the computation complex analysis in Section III-C and these illustrations of experimental results, it can draw a conclusion that ClubCF may gain good scalability via increase the parameter Formula$K$ appropriately. Along with adjustment of Formula${\gamma }$, recommendation precision is also improved.

SECTION VI

RELATED WORK

Clustering methods for CF have been extensively studied by some researchers. Mai et al. [37] designed a neural networks-based clustering collaborative filtering algorithm in e-commerce recommendation system. The cluster analysis gathers users with similar characteristics according to the web visiting message data. However, it is hard to say that a user’s preference on web visiting is relevant to preference on purchasing. Mittal et al. [38] proposed to achieve the predictions for a user by first minimizing the size of item set the user needed to explore. Formula$K$-means clustering algorithm was applied to partition movies based on the genre requested by the user. However, it requires users to provide some extra information. Li et al. [39] proposed to incorporate multidimensional clustering into a collaborative filtering recommendation model. Background data in the form of user and item profiles was collected and clustered using the proposed algorithm in the first stage. Then the poor clusters with similar features were deleted while the appropriate clusters were further selected based on cluster pruning. At the third stage, an item prediction was made by performing a weighted average of deviations from the neighbor’s mean. Such an approach was likely to trade-off on increasing the diversity of recommendations while maintaining the accuracy of recommendations. Zhou et al. [40] represented Data-Providing (DP) service in terms of vectors by considering the composite relation between input, output, and semantic relations between them. The vectors were clustered using a refined fuzzy C-means algorithm. Through merging similar services into a same cluster, the capability of service search engine was improved significantly, especially in large Internet-based service repositories. However, in this approach, it is assumed that domain ontology exists for facilitating semantic interoperability. Besides, this approach is not suitable for some services which are lack of parameters. Pham et al. [41] proposed to use network clustering technique on social network of users to identify their neighborhood, and then use the traditional CF algorithms to generate the recommendations. This work depends on social relationships between users. Simon et al. [42] used a high-dimensional parameter-free, divisive hierarchical clustering algorithm that requires only implicit feedback on past user purchases to discover the relationships within the users. Based on the clustering results, products of high interest were recommended to the users. However, implicit feedback does not always provide sure information about the user’s preference.

In ClubCF approach, the description and functionality information is considered as metadata to measure the characteristic similarities between services. According to such similarities, all services are merged into smaller-size clusters. Then CF algorithm is applied on the services within the same cluster. Compared with the above approaches, this approach does not require extra inputs of users and suits different types of services. Moreover, the clustering algorithm used in ClubCF need not consider the dependence of nodes.

SECTION VII

CONCLUSION AND FUTURE WORK

In this paper, we present a ClubCF approach for big data applications relevant to service recommendation. Before applying CF technique, services are merged into some clusters via an AHC algorithm. Then the rating similarities between services within the same cluster are computed. As the number of services in a cluster is much less than that of in the whole system, ClubCF costs less online computation time. Moreover, as the ratings of services in the same cluster are more relevant with each other than with the ones in other clusters, prediction based on the ratings of the services in the same cluster will be more accurate than based on the ratings of all similar or dissimilar services in all clusters. These two advantageous of ClubCF have been verified by experiments on real-world data set.

Future research can be done in two areas. First, in the respect of service similarity, semantic analysis may be performed on the description text of service. In this way, more semantic-similar services may be clustered together, which will increase the coverage of recommendations. Second, with respect to users, mining their implicit interests from usage records or reviews may be a complement to the explicit interests (ratings). By this means, recommendations can be generated even if there are only few ratings. This will solve the sparsity problem to some extent.

Footnotes

This work was supported in part by the National Science Foundation of China under Grants 91318301 and 61321491, the National Key Technology Research and Development Program of the Ministry of Science and Technology under Grant 2011BAK21B06, and the Excellent Youth Found of Hunan Scientific Committee of China under Grant 11JJ1011.

Corresponding Author: R. Hu

References

No Data Available

Authors

Rong Hu

Rong Hu

Rong Hu was born in Xiangtan, Hunan, China, in 1977. She is currently pursuing the Ph.D. degree in computer science and technology from Nanjing University, China. She received the M.A. degree from the College of Information Engineering, Xiangtan University, in 2005. Her current research interests include service computing and big data.

Wanchun Dou

Wanchun Dou

Wanchun Dou was born in 1971. He is a Professor with the Department of Computer Science and Technology, Nanjing University, China. He received the Ph.D. degree from the Nanjing University of Science and Technology, China, in 2001. Then, he continued his research work as a Post-Doctoral Researcher with the Department of Computer Science and Technology, Nanjing University, from 2001 to 2002. In 2005, he visited the Hong Kong University of Science and Technology as a Visiting Scholar. His main research interests include knowledge management, cooperative computing, and workflow technologies.

Jianxun Liu

Jianxun Liu

Jianxun Liu was born in 1970. He received the M.S. and Ph.D. degrees in computer science from the Central South University of Technology and the Shanghai Jiao Tong University in 1997 and 2003, respectively. He is currently a Professor with the Department of Computer Science and Engineering, Hunan University of Science and Technology. His current interests include workflow management systems, services computing, and cloud computing.

Cited By

No Data Available

Keywords

Corrections

None

Multimedia

No Data Available
This paper appears in:
No Data Available
Issue Date:
No Data Available
On page(s):
No Data Available
ISSN:
None
INSPEC Accession Number:
None
Digital Object Identifier:
None
Date of Current Version:
No Data Available
Date of Original Publication:
No Data Available

Text Size