Heterogeneous attention concentration link prediction algorithm for attracting customer flow in online brand community

Attracting users from a mature large online product community to a new small one by friend recommendation is vital for new product marketing in social network. However, the traditional link prediction algorithms for friend recommendation cannot get high accuracy because of the network sparsity and scale-free problems when attracting customer flow between large and small circles. In order to better adapt to the link prediction of node pairs between circles of different sizes, we propose a collaborative combined link prediction algorithm (CCLPA), which can deeply extract user attention concentration (AC) features in sparse networks. CCLPA possesses three distinctive merits. Firstly, different edges in the network are assigned different attention, and heterogeneous attention concentration indexes (HACIs) within and beyond triadic closure structure are defined accordingly. Second, a random forest (RF) model is designed to adaptively select the appropriate HACIs for a given circle structure, so as to avoid the impact of scale-free problem on link prediction accuracy between different circles. Third, according to the collaboration of the selected indexes and their sensitivity to the circle structure, appropriate sensitive collaborative heterogeneous attention concentration index (SCHACI) is built to avoid the negative impact of blind combination of indexes on predicted performance. Experimental results on Twitter confirm the effectiveness of our proposed method in attracting customer flow in online brand community.

influential users in the community with mature large products.
With the help of the influence of friendships, some users in the large-scale community are attracted to the new community of small-scale, that is, attracting customer flow. Through the transfer of customer flow, we can promote users in the mature product circle to buy new products, so as to realize the promotion of new products.
When user nodes in different circles establish links, they will face the problem of network sparsity, which is characterized by the average degree of the network is far less than the number of nodes [7]. Besides, prediction links between users in circles of large and small scales face the problem of scale-free [8], that is, degrees of nodes in large circles are larger, while the degrees of nodes in small circles are smaller. The major method for friend recommendation in social networks is the scoring link prediction algorithm (SLPA) [9]. However, most of these algorithms are fixed, rigid and based on large networks, which consider neither the different attention users assigning to each edge nor the self-adaptive construction of suitable SLPA for the specific circle structure. As a result, these SLPAs cannot deeply extract the possible friendships in sparse networks, and worse yet, their performances leave much to be desired when the degrees of node pairs fluctuate greatly. Therefore, to fill the gap of previous studies, we consider the heterogeneity of attention concentration (AC) allocated by nodes to each edge and propose a collaborative combined link prediction algorithm (CCLPA) from the perspective of heterogeneous AC of users in triadic closure structure. Firstly, in order to fully describe the possibility of friendships between node pairs in sparse networks, considering that different edges in the network are assigned different attention, heterogeneous attention concentration indexes (HACIs) within and beyond triadic closure structure are constructed. Then, for the sake of overcoming the influence of scale-free problem on prediction accuracy, a random forest (RF) model [10] is designed to select suitable HACIs for the specific network according to the shortest distance between nodes and the AC of nodes. Next, we build the suitable composite HACI which can deeply extract user AC features in circle structure from three aspects. To begin with, three kinds of suitable HACIs are selected for each network, namely the most, second and third suitable HACIs.
Secondly, the logistic regression (LR) model is developed to identify the appropriate collaborative HACIs for each suitable HACI, and subsequently, three collaborative suitable composite HACIs are built, which can avoid the negative impact of the blind combination of indicators on prediction performance. Finally, three collaborative suitable composite HACIs are merged into sensitive collaborative heterogeneous attention concentration index (SCHACI) according to their sensitivity to the circle structure, so as to deeply extract user AC features in sparse networks and accurately predict the possible links between small and large circles. According to the predicted links, users in small and large circles can be recommended to become friends, and subsequently, accurately attracting customer flow is achieved.
The remainder of this study is organized as follows: Section 2 introduces the link prediction; Section 3 offers friend recommendation; Section 4 explains the CCLPA; Section 5 offers the experimental design and the analysis of the results; Section 6 gives conclusions of this study.

II. LINK PREDICTION
Link prediction is the fundamental problem in social network analysis [11]. The existing link prediction techniques can be roughly divided into four categories: similarity-based method, probabilistic and maximum likelihood-based method, dimensionality reduction-based method and algorithm-based method [12][13] [14].
Among these methods, the similarity-based method is the most widely studied, which can be applied to large-scale networks. However, the traditional link prediction methods based on local similarity are mostly based on large networks, and their performance in sparse networks is not satisfactory.
Some scholars have proposed algorithms for sparse networks.
Shang et al. [15] constructed heterogeneity index (HEI), homogeneity index (HOI) and heterogeneity adaptation index (HAI), and proposed a link prediction algorithm to solve the network sparsity and scale-free problems faced in link prediction, but the algorithm could only get good performance in regular tree networks with high heterogeneity. Zhang et al. [16] proposed a link prediction framework, AdaSim, by introducing an Adaptive Similarity function using features obtained from network embedding based on random walks.
Experimental results showed that AdaSim was robust to different sparsities of the networks. Nguyen and Mamitsuka [17] transformed the link prediction problem into a binary classification problem, and used the kernel function to represent the potential network characteristics, so that the method could be extended to large-scale networks. They also proved that this method could be well applied to sparse networks. However, these algorithms did not abundantly consider the heterogeneity of the network, so they could not fully mine the information contained in the network, and could not ensure accurate prediction in some special cases (such as link prediction between circles of different scales proposed in this study).
In recent years, a few scholars have developed link prediction algorithms based on heterogeneous networks by considering the weights of edges and combining them with a variety of network features. For example, Ozcan et al. [18] proposed a new method called multivariable time series link prediction for evolving heterogeneous networks by combining the node connection information, local similarity indicators and global similarity indicators of time, link and multitype relationships. Bütün et al. [19] proposed a new link prediction method combining with directed, weighted and time information of links based on the neighborhood-based link prediction method. Kuo et al. [20] devised a novel unsupervised framework to predict the opinion holder in a heterogeneous social network without any labeled data. Liu et al. [21] used three zero models to describe the topological structure and link weights of the network, and generated a general link prediction method by combining them.
Aghabozorgi et al. [22] measured the similarity of nodes based on the recent activities of nodes and the weights of edges, and proposed a supervised link prediction method that took network features and node similarities as its feature sets. Lü and Zhou [23] used local similarity indicators to estimate the possibility of links in the weighted network, including common neighbors (CN), Adamic ADAR indicators (AA) and resource allocation indicators (RA). However, the indicators selected by them could not prove that the prediction performance of weighted links was better than unweighted links. Shang et al. [24] found that shifting attention from the direct link weight between nodes to the link weight between nodes and common neighbors could improve the performance of the algorithm.
Shang et al. [25] proved that the weight value of network structure and the number of common neighbors played an important role in link prediction. Similarly, the link prediction algorithm proposed in this study not only considers the weight of the link, but also fully considers the heterogeneous AC assigned by nodes to different edges when constructing the HACIs, which helps to fully mine the information contained in the network.
In network link prediction, we can get better algorithm performance by shifting attention from the direct links between nodes to the common neighbors of node pairs. For example, existing studies have proved that algorithms with more common neighbor effects could achieve better performance in Facebook network, Contact network and E-mail network. This showed that if two users had more common friends, they were more likely to establish friend relationships in the future [24].
Guimera & Sales-Pardo [26] reconstructed the network by observing the missing and false links of the network based on the impact of nodes on their common neighbors. Vallès-Català et al. [27] constructed prediction indicators through the common neighbors of node pairs, indicating that in order to pursue the best link prediction results, over fitting problems might occur. Lü et al. [28] proposed a local path index (LP index) based on the common neighbors of node pairs to estimate the possibility of links between nodes. A large number of simulation experiments on networks showed that LP index had higher efficiency and effectiveness than two widely used common neighbor indexes: CN and Katz index. In addition, there were many other indicators used to predict the links between node pairs based on their common neighbors in existing studies, such as average compute time (ACT), random walk with restart (RWR), matrix forest index (MFI), etc [14].
Based on this, we fully consider node pairs and their common neighbors when constructing indicators. Moreover, in order to further make full use of the large range of network information and mine the possible links in extremely sparse networks, we also consider the role of the neighbors of the common neighbors, that is, the indirect common neighbors, when constructing the HACIs.
Many scholars have proposed friend recommendation algorithms in social networks based on link prediction. For example, Cheng et al. [29] proposed an extensible friend recommendation framework, combined with seven information sources of personal characteristics, network structure characteristics and social characteristics. Chen et al. [30] combined social impact, used learning ranking technology to analyze user behavior, and proposed a learning-based recommendation method to recommend informative friends for users. Ma et al. [31] proposed local friend recommendation indexes and mixed friend recommendation indexes based on weak group structure. Yu et al. [32]    , users in П 1 can be recommended to users in П 2 as friends [13]. describes the network with predicted links. In Fig. 1(a), users belonging to П 1 are 1, 2, 3, 4, and users belonging to П 2 are 5, 6,7,8,9,10,11,12,13. Assume that user 3 is a marketer, the purpose of friend recommendation is to make influential users in П 2 and their friends buy product П 1 . Because node 8 has the largest degree, it is selected as the most influential node in П 2 . Then, based on the link prediction algorithm, the possibility of friendships between user 8 and 3 is predicted, and if possible, recommends user 3 to become the friend of user 8, as shown in Fig. 1(b). User 3 encourages user 8 and his friends to buy product П 1 by the influence of friendships, and realizes accurately attracting customer flow from the mature product circle to the new one [13]. However, the challenge of attracting customer flow from large to small circles is how to overcome the network sparsity and scale-free problems. So, this study proposes CCLPA to solve these problems.

B. HACI FOR FRIEND RECOMMENDATION
Attracting users from a mature product community to a new one is an effective way of new product marketing. In order to overcome the problem of network sparsity when recommending friends among different scale circles, the hidden feature structure information in the sparse network is deeply extracted by constructing a variety of HACIs.
Specifically, suppose that the degree of node i is (node degree refers to the number of edges connected with the node), based on degree , the attention assigned by user i to its each link in the social network is 1 + , where is a constant.
Therefore, based on the above methods, different attention is assigned to each edge and HACIs are proposed.
Triadic closure structure was proposed in complex network researches by Newman [36], and it has been widely adopted by many scholars [14][15][37] [38]. It refers to social properties contained in the triple composed of three nodes X, Y and Z, as shown in Fig. 2, that is, if there is a connection between node pair (X, Y) and (X, Z), then it is easier for Y and Z to establish friendship. In order to fully describe the possibility of establishing friendships between node pairs in sparse networks, we construct HACIs within and beyond triadic closure structure. HACIs within triadic closure structure only contain the information of the node's direct common neighbors, and HACIs beyond triadic closure structure involve the information of the node's indirect common neighbors.

HACI between node pairs and their direct common neighbors
PA, CN, Salton, Jaccard, Sorenson, HPI, HDI and LHN are redefined according to the heterogeneous AC, that is, considering that the attention assigned to various edges is different. And they are divided into two categories based on the network structure, one is HACIs between node pairs, the other is HACIs between node pairs and their direct common neighbors, as shown in Table 1.
In Table 1, the basic principle of the definition of HACIs is that the higher the AC of common neighbors is, the more possible the node pairs have friendships and the greater the scores of the node pairs are, vice versa. In Table 1, (•) represents the neighbor sets of nodes, and ( , ) indicates the AC assigned by node x to node y, namely 1 + .
represents the AC of node x, namely the sum of the attention assigned by neighbors of node x.

HACI between direct common neighbors and neighbors of node pairs
According to heterogeneous AC, we define the HACI between direct common neighbors and neighbors of node pairs, the formula is shown in Table 1.

2) HACI BEYOND TRIADIC CLOSURE STRUCTURE
Considering the sparsity of the local network caused by the connection between different scale circles, HACIs between node pairs and their indirect common neighbors are developed innovatively. Specifically, attention assigned to common neighbors by the friends of common neighbors will change the link of common neighbors, and then indirectly affect the attention that the common neighbors assign to the target node pairs.
(a) TA1 TA1 represents attention that indirect common neighbors assign to node pairs. The fewer friends the indirect common neighbors have, the more attention they assign to their common neighbors, which indicating a higher score of AC, as shown in formula (1).
TA2 represents attention assigned to node pairs by direct and indirect common neighbors. The smaller the clustering coefficients are, the higher AC indirect common neighbors have. The fewer friends the common neighbors have, the more attention the common neighbors assign to node pairs, which indicating the higher AC, as shown in formula (2).
is the clustering coefficient of node z, represents the number of edges connected between the neighbors of node z. and are constants.
(c) TA3 Obviously, the more friends the nodes have, the more attention is dispersed. In TA3, the attention dispersion of the nodes and the attention dispersion of the common neighbor nodes are combined, as shown in formula (3). Therefore, RTA is constructed, as shown in formula (4).
sensitivity to the network structure, and subsequently, friendships between users in small and large circles are predicted. Fig. 3 shows the structure of CCLPA.  AUC value can be interpreted as the probability that randomly selected missing links get higher scores than nonexistent links. In this study, it can be defined in formula (5), where represents independent comparisons, ′ denotes the times of the linked node pairs having higher scores, and the larger AUC indicates the higher accuracy of the algorithm. In Fig. 4(a), in order to test the accuracy of the algorithm, we need to select some existing links as probed links. For example, we select (1,3) and (4,5) as probed links, as shown by the dotted lines in Fig. 4 (b). The algorithm can only be trained by using the information contained in the solid lines in Fig. 4 (b).
Assume that the algorithm assigns scores S12, S13, S14, S34 and S45 to all unobserved links. Then, we need to compare the scores of probed links and non-existent links [14], and accordingly, calculate AUC using formula (5).

B. SELECTING HACI BASED ON RF
The sparsity of network circles is diverse, which may lead to over-fitting of the model when identifying HACIs, that is, the model is too accurate to adapt to specific circles but cannot adapt to other circles reliably [27]. Because RF has high prediction accuracy and good tolerance for outliers and noise, and is not prone to over-fitting [40], it is selected to adaptively screen HACIs for specific circles.

Network characteristic indexes
In this study, in order to accurately describe the direct and potential friendships in the network, the following two b) Network efficiency.
In the network, the smaller the social distance between nodes is, the smoother the communication between nodes will be, and the higher the network efficiency is. The calculation of network efficiency is shown in formula (7).
Where Q represents the total number of nodes in the network.
(2) Indicators relevant to the AC of nodes. a) Average node intensity.
The average node intensity represents the average AC allocated to each node in the network, as shown in formula (8).
Where = ∑ ∈ ( ) indicates the AC of node i, represents the attention assigned by node j to the link between node i and j.
Degree heterogeneity indicates the heterogeneity of AC of nodes in the network, as shown in formula (9).
The assortativity coefficient represents the matching characteristics of nodes' AC. If the assortativity coefficient is more than 0, then the network is called assortative, and nodes with similar AC tend to link with each other. If the assortativity coefficient is less than 0, then the network is heterozygous, and nodes with large AC differences tend to link with each other, as shown in formula (10).
Where M is the total number of connected edges in the network, and are the AC of node i and j, respectively.

RF model
In This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.   To sum up, the steps of generating random forest are summarized as follows.
Step Step 2: For the j-th sample subset, calculate the network characteristic indexes related to the shortest path, degree and AC of nodes. Randomly select = (√ ) indexes from k network characteristic indexes as the candidate segmentation feature subsets, calculate the partition Gini index of each candidate characteristic index, select the characteristic indexes with the smallest partition Gini index as the root node or higher-level node, and then use its best branch threshold to branch.
Step 3: Use the same method in step 2 to recursively establish tree branches for data subsets corresponding to branches with different characteristics, until all sample data of each branch belong to the same type of HACI.
Step 4: Repeat step 2 and step 3 in parallel to generate all N decision trees.
Step 5: Extract decision rules. For each decision tree generated in step 4, decision rules can be mined directly, that is, the most, second and third suitable HACIs for the circle can be identified according to the characteristic indexes of the community network.
Step   the similarity distance between any two HACIs (i.e. C and D)

Factors for identifying collaborative HACI
is described from the following three dimensions.
(1) The difference between two HACIs, which can be described by the Hamming distance and Jaccard distance, as shown in formula (13) and (14), respectively.
Hamming distance of HACIs C and D is shown in formula (13).
(2) The similarity distance between HACIs, which can be represented by the square of Euclidean distance ( ) and Minkowski distance ( ) , as shown in formula (15) and (16), respectively. Where p is a constant.
(3) The degree of deviation between two HACIs, which can be described by the mean absolute difference ( ) and mean square error ( ), as shown in formula (17) and (18) (19).

D. SCHACI
In order to improve the performance of the combined HACIs, the average AUC value for the HACI calculated in Where indicates the weight of , that is, its average AUC value. Similarly, represents the weight of .
To   The purpose of this study is to recommend users in small product circles to users in large mature product circles, if there are possible friendships, then become fans of users in small circles, and realize the sales of new products to users in mature product circles. This is very similar to the directed relationship between users in the Twitter. Therefore, we use the Twitter directed data set obtained from "Stanford Network Analysis

A. EXPERIMENTAL DESIGN
Project" website (http://snap.stanford.edu/data/ego-Twitter.html) to verify CCLPA. It should be noted that the HACIs constructed in this study are based on undirected network, which can be used not only for one-way link prediction between nodes, but also for two-way link prediction.
Therefore, the proposed HACIs are fully applicable to Twitter.
Through 971 data sets in Twitter, we verified the validity of CCLPA. Each ego-network in the data set indicated an online brand community, in which the central node denoted the brand enterprise, and users were clustered into various product circles. Selected 300 networks with product circles from Twitter. In each experiment, 240 networks were randomly selected from 300 networks as training sets and the other 60 networks as test sets. In order to test the efficiency of CCLPA on attracting customer flow, the links between the nodes in the large and small circles were predicted in each network. Table 2 shows the mean, minimum and maximum of the statistical indicators of the selected network samples.
For a brief description, there are 25 algorithms shown in Table 1 and Table 3. Table 1 shows HACIs without parameters and Table 3 shows HACIs with specific parameter values.   Table 1 were used to predict the links between node pairs. All algorithms were applied in MATLAB with default settings.  Table 4 demonstrates the average AUC of all algorithms proposed in this study in 100 random experiments. Fig. 7 displays the performance comparison of different HACIs. Fig.   8 shows the performance comparison between single RF and non-combined HACIs. Table 5 and Fig. 9  CCLPAe represents the weighted composite index, which integrates the most suitable index and all collaborative indexes. Table 6 and Fig. 10 show the performance comparison between the newly defined HACIs and the original SLPA indexes.    can be seen that the CCLPA framework proposed in this study is far better than the LR framework. Table 6 and Fig. 10 verify that the newly defined HACIs in this study perform better than those in the original SLPA, which displays that from the perspective of heterogeneity AC, the hidden feature structure information in the sparse network can be extracted in-depth, and effectively overcome the scalefree problem of recommending users in large circles to small circles.

B. ALGORITHMS PERFORMANCE COMPARISON
At the same time, Fig. 7 shows that the accuracy of 19 kinds of algorithms from RA* to RTAl is significantly higher than that of the other 7 kinds of HACIs. In addition to RA*, these optimal HACIs are all newly proposed, which demonstrates that the proposed HACIs within and beyond triadic closure structure based on the principle of heterogeneity AC on the edges can effectively overcome the network sparsity problem in predicting the user friendships between various circles.

NETWORK SCALES
Although the overall performance of CCLPA has been confirmed in the previous section, it is still necessary to analyze the performance of CCLPA on recommending friends in various scales of circles and different node densities.
Accordingly, networks with product circles from 2 to 9 in Twitter were selected. In addition, HACIs with higher accuracy, namely RA* and RAA*, were chosen from Table 4 for comparison with the performance of CCLPA. Table 7 and Fig.   11 show the performance of CCLPA, RA* and RAA* in different networks. It can be observed in Table 7 that, compared with the other two algorithms, the average AUC value of CCLPA is the largest, which confirms that CCLPA has superior robustness and can produce excellent performance in networks with more or fewer product cycles. Fig. 11 demonstrates that CCLPAa has high precision whether in the large circle or the small circle.
However, when the circle is small, the prediction performance of RAA* is the worst. When the circle is the middle or maximum, the performance of RA* is the worst. Conversely, CCLPA can make good friend recommendations, no matter the circle is large or small.

VI. CONCLUSIONS AND FUTURE WORKS
In the early stage of sales, the marketing community established for new products is generally small, it is needed to attract customers from the large mature product communities to the new one. In order to attract users accurately, it is essential to predict the friendships between the small circles formed for new products and the large circles formed for mature products.
The existing researches on link prediction are usually based on the SLPA, which ignore that the AC on edges is different, and cannot overcome the impact of network sparsity and scale-free problems on link prediction accuracy between large and small circles. CCLPA is proposed to deeply extract the hidden feature structure information in the sparse network and overcome the fluctuation of algorithm accuracy caused by the scale-free network by self-adaptive construction of SCHACI. Compared with the existing researches, the distinctive aspects of this study are mainly reflected in the following. This research is only suitable for static network link prediction, and there is a need to further explore the link prediction in dynamic networks with links generating and breaking. In the future, the prediction framework proposed in this study will be applied to dynamic networks, and efforts will be made to improve the CCLPA so that the algorithm can predict the new and breaking links.