Community Detection in Social Networks Considering Social Behaviors

The study of community detection in networks has drawn great attention in recent years. To find communities and to understand community semantics, both network topology and network content are utilized. Unfortunately, none of them can explain the driving factors of generating community structure with semantics, which is significant for understanding the mechanisms of community generation. Our observations on a large number of networks show that specific user social behaviors are underlying factors for the generation of community structure. We exploit four types of social behaviors that widely exist in networks, i.e., reciprocity of interactions, posting preference, multitopic preference, and temporal variation of topics. We investigate their impacts on the formation process of links and content in networks, during which communities with topics form. Our analysis shows that they are highly related to community structure. Consequently, a generative community detection model SBCD (social behavior-based community detection) is proposed by combining network topology and content, in which the above social behaviors play a core role. The model is evaluated on two real datasets. The experimental results show that SBCD outperforms state-of-the-art baselines. Finally, a case study illustrates several significant observations with respect to the proposed social behaviors.


I. INTRODUCTION
Community detection is one of the hot research topics in the network science field [1], [2], [3], [4]. We can better understand networks and their functions by identifying community structure, which is an inner characteristic of them. A community is defined as a group of nodes that are densely connected but have sparser connections with the nodes from other groups [5]. A large number of community detection methods have been proposed [6], [7], [8], [9], [10], [11], [12], [13], [14]. Some of them utilize only network topology, while others integrate network content into their models. The adoption of network content (e.g., posts, tweets) makes the The associate editor coordinating the review of this manuscript and approving it for publication was Barbara Guidi . understanding of community semantics possible [15], [16], [17]. Topics that are discussed in a community are considered as community semantics. Due to the homophily [18], people with similar attributes (interested topics) communicate with each other more frequently and produce denser links, which generates community structure. However, attributes might be inconsistent with topology in the perspective of community structure. Therefore, the attributes of nodes should be processed carefully to improve community detection accuracy, e.g., setting a balance value between topology and attributes to fit the different structure-attributes correlation [19], [20], [21].
Network topology and network content are the outcome of users' social behaviors. One of the elements of users' social behaviors is users' activities [22] (e.g., interactions between VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ users and publishing posts). In social networks, e.g., Twitter, Facebook, and Reddit, users communicate with each other to express ideas by retweeting or commenting on tweets/posts. Relationships between users are generated as links and result in complex network topology. All texts posted by users make up network content that includes semantic information and reflects topics of interest for users. In the generation process of networks, community structure is generated as well.
Considering the relations between network generation and social behaviors, an issue is raised regarding how users' social behaviors affect community structure and community semantics. Investigating the issue is critical to reveal the fundamental generation mechanism of community structure with semantics. To resolve this challenge, there are two key questions that need to be answered.
1) For each link between two users, what is the process of its formation with respect to community structure? Since network topology is the most important information based on which the communities are detected, we consider the formation process of all links. Suppose that user i publishes a tweet about some topics. Later, user j retweets it. Then, a directed link from user j to user i forms. The reasons for the generation of this link are related to two users' intents (i.e., user j is interested in the topics proposed by user i) and latent relationships of two users' attributes (i.e., they are in the same community or they have similar background attributes). In other words, the reasons not only relate to the community distributions of users but also to their topics of interest. Unfortunately, how to model link generation by considering community structure, community semantics and users' social behaviors has not yet been well addressed.
2) How does each user make contributions to the formation of network semantics? This is a critical element regarding the identification of community semantics [23], [24], [25], [26]. Network content is processed in association with network topology to detect community semantics. To answer this question, we need to consider the following two aspects. First, there are two types of content in networks, i.e., link content and node content. For example, a tweet that is not used to respond to others is node content, which means that the tweet is not on a link. A tweet with which a user replies to others is considered as link content. They play different roles in community detection because node content is the first step in creating a link. Therefore, separating link content and node content can accurately model network topology with community structure and network content, which further improves the accuracy of community detection. Moreover, considering the integrity of the dataset, even if a user has never interacted with others but only posted some posts, he should not be deleted from the dataset to better identify community semantics. Second, users in a network might publish posts or have discussions on multiple topics, i.e., users are interested in multiple topics [27]. Moreover, their topics of interest might change with time. How to accurately identify users' changing topics in community detection models is not well resolved.
Although many models have been proposed for community detection and have achieved great improvements, abovestated questions have not been resolved completely when considering social behaviors. Users' social behaviors not only include user activities but also social context (e.g., users' topic interests and posting habits). In this paper, we investigate four types of user social behaviors, which can help answer the above questions. They are stated as follows.
1) Reciprocity of interactions represents the tendency of generating a link between two users, e.g., from user i to user j. The probability of this tendency is highly connected with three factors: a) whether those two users belong to the same community, b) whether those two users are interested in the same topics [28], and c) whether target user j is popular or authoritative in a community. This social behavior indicates that users are more likely to communicate with each other when they are both in the same community and have the same topic preferences. They are less likely to have mutual relationships when they are in the same community but with different topic preferences. They are most unlikely to form links when they are neither in the same community nor have the same topic preferences. On the other hand, if user i is a popular user or a person of authority, the probability that others will reply to his or her activity/message is higher. Moreover, as people might have similar interests with people from outside the community, links exist not only inside communities but also between communities. Considering the reciprocity of interactions can explain the fundamental generation process of community structure. This social behavior can resolve the problems caused by the first question, i.e., how links are generated with respect to community structure and community semantics.
2) We take users' posting preference into consideration. Take Twitter as an example. When a user sends a tweet, he intends to express his idea on a certain topic. At this time, he cannot predict who will respond. On the other hand, when another user replies to or retweets the tweet with his own words, a link is generated. After analyzing a large number of real networks, we find that each user in a network has his or her own habits. As illustrated in Fig. 1, users can be classified into three categories. a) Users in the first category prefer publishing posts, and they seldom interact (reply or retweet) with others. Some of these posts are never commented on/replied to by others, which is a common case in social networks. Therefore, they do not make any contributions to the forming of links but make contributions only to the formation of network semantics. Some other posts might be commented on very actively by other users. These posts trigger link creation between users. b) Users in the second category like both publishing posts and interacting with others. These users contribute to network semantics with link content and post content. c) In the third category, users always interact with others, but they do not post documents. In this situation, all the contributions they make to the network semantics are link content. In conclusion, users' posting behaviors affect the formation of network semantics in different ways. The generation process of a social network with community structure. Users are divided into three categories. a) Users in the first category prefer publishing posts and they seldom interact with others; b). Users in the second category both like publishing posts and interacting with others; c). Users in the third category always interact with others while seldom publish posts. Suppose that there are three communities (i.e., com1, com2, and com3) and three topics (i.e., topic1, topic2, and topic3). Pie charts denote user-community distribution. Doughnut charts denote community-topic distribution. Each user in rectangles has his own community distribution (shown on top of a person). Users post two types of documents, i.e., isolated docs that are not replied/forwarded by others and interactive docs that are replied/forwarded by others. Fw:/Re: denotes replying or forwarding posts of others with text content.
This behavior can resolve the first aspect of the second question, i.e., separating link content and node content.
3) Users always focus on multiple topics. Multitopic preference indicates that users in a network publish posts/tweets focusing on multiple topics. This social behavior has two effects on community structure. First, a user might interact with others who are not in his community, which creates intercommunity links. This increases the complexity of the community structure. Second, users' posts are fundamental factors in generating community semantics. Since a community is a set of users who share similar topics [29], users' interests in multiple topics result in multiple topic distributions in communities. Considering that individual topic distribution results in too many parameters in our model because of the large number of users in networks, in this paper, we consider community-level topic distribution. This behavior can resolve the second aspect of the second question, i.e., users who are interested in multiple topics. 4) Users' topics of interest might change over time. We call this social behavior the temporal variation of topics. Individual topic changes lead to the creation of new topical content from a user and new links between users who were not connected previously, which results in changes in network content and network topology. Moreover, observing users' topic-changing patterns is significant when we focus on a specific user, e.g., a leader in a community. This behavior also addresses the second aspect of the second question, i.e., identifying users' changing topics in the community detection model.
In summary, user social behaviors are closely related to the generation of community structure and community semantics. The above four social behaviors give us clues to reveal the mechanism of the generation of community structure. Based on them, we propose a novel generative community detection model to generate network topology, node content and link content. Our contributions are summarized as follows: • We validate that specific user social behaviors are highly related to the generation of communities. We propose four social behaviors that can accurately describe the inner relations of community structure, community semantics, network topology and network content.
• We propose a novel unified community detection model by integrating network topology, node content, link content and four types of user social behaviors.
• We conduct sufficient experiments to verify our observations of the proposed social behaviors. The results show that our model achieves a more precise community structure than baselines. Moreover, our case studies illustrate the existence of the proposed social behaviors. VOLUME 10, 2022 The rest of the paper is organized as follows: Section 2 reviews related works; in Section 3, we describe the details of our model; Section 4 describes the model inference; Section 5 shows the experiments and results; finally, in Section 6, the paper is concluded, and future work is presented.

II. RELATED WORK
In this section, related work for community detection is reviewed. We discuss the main contributions of these studies and represent the differences between them and our model.
Many community detection methods have been proposed in recent years, and they have been reviewed in survey articles: [9], [11], [12], [13], which show great interest and a large body of knowledge in this field. Some of them utilize topological information of networks [30], [31]. The goal of these studies is to divide nodes into groups, i.e., communities, without considering any semantic information [32], [33].
Today's social networks exhibit rich information, e.g., user profiles and post content. Utilizing such content to identify semantic information has become a hot topic [34], [35], [36], [37], [38], [39]. A survey [20] is conducted to study attributed community detection task. The work of [40] proposes a comparative study of some existing attributed network community detection algorithms on both synthetic dataset and on real world dataset. A model that generates synthetic node attributed graphs with planted communities is proposed by [41]. The work of [21] proposes a unified weight-based attributed community detection model. Network content reflects individual interested topics. According to homophily principle [18], nodes with similar interested topics are more likely to communicate with each other, which can help improve community detection accuracy [42], [43]. Therefore, an increasing number of studies integrate network topology and network content in community detection models.
To protect the privacy of users, the storage of social data is decentralized, which leads to the definition of Decentralized Online Social Networks (DOSNs) [44]. Several dynamic community detection methods have been proposed in DOSNs [45], [46], [47].
Beyond community structure and community semantics, recent studies have begun to investigate community-level diffusion, i.e., modeling the diffusion patterns of topics across different communities [48] and community profiling [49]. The work in [49] characterizes the intrinsic nature and extrinsic behavior of a community, i.e., community profiling. It considers the heterogeneity among user links (i.e., friendship links and diffusion links). If two users interact with each other to diffuse information, then diffusion links are formed (e.g., a retweet is considered as a diffusion link in Twitter). Both [48] and [49] propose a unified latent framework in which node content and links are both generated by the same latent variables (e.g., community distribution variable, topic distribution variable and word distribution variable). In addition to community detection, they efficiently identify temporal topics within communities, diffusion of communities [48] or community profiling [49].
Investigating social behaviors has been a hot topic in social network research. Social behaviors include the actions of users, e.g., reposting actions [50], commenting on other users' posts, or following others. In [51], the authors investigate the emotional contagion between nodes by considering community structure. The work of [52] classifies users into three roles, i.e., opinion leader, structural hole spanner, and ordinary user. In [53], the authors propose a community detection model based on NMF, in which nodes are considered in the roles of hubs (also known as leaders) or outliers. The work of [54] models user behavior knowledge from different networks to alleviate the data sparsity problem. The work of [55] investigates retweeting behavior on Twitter. It proposes a factor graph model to predict retweeting behavior. The impact of user's availability on on-line ego networks are analyzed by [56]. Communities are detected based on user activity and social ties in [57].
However, existing studies do not consider the underlying factors for community generation. Our model is different from above models. We use four types of social behaviors to generate all posts and links. We show that considering them indeed improves the outcomes of community detection tasks.

III. SOCIAL BEHAVIOR-BASED COMMUNITY DETECTION
In this section, we first formulate the problem of integrating social behaviors, network topology and network content for community detection. Then, we propose a model, namely, social behavior-based community detection (SBCD). Finally, we describe the generative process of the proposed model.
The notations used in our model are shown in Table 1.

A. PROBLEM FORMULATION
In this section, we state the definitions in our model and summarize the formulation of the problem we solved.
. U is the node set. E is the edge set. D is the network content set. e ii is a directed link from node i to node i . Definition 2: Community membership is associated with each user and defined by a vector π. Its dimension equals the number of communities. For a user i, π ic is the probability of belonging to community c. A user in a network may belong to multiple communities, i.e., the user is assigned to all communities with different probabilities. We set a threshold to obtain his real communities. Therefore, our model can detect overlapping communities.
Definition 3: A topic k ∈ K is defined as a multinomial distribution over vocabularies. For a word w, φ kw is the probability of belonging to topic k. To investigate community topics, we need to process all documents (e.g., posts in a forum network or papers in a citation network). Given a dataset, we fix its vocabulary with |W | words. A topic is denoted by a set of words. Each word belongs to a topic with a probability φ kw . Definition 4; Community-topic distribution is defined by a multinomial distribution over topics. For a community c, θ ck represents the probability of focusing on topic k. A community is a set of users with dense connections. Users in a community might talk about several topics. These topics are used as the semantics of the communities. Therefore, each community focuses on multiple topics.
Definition 5: The time stamp distribution of a user focusing on topic k is defined by a multinomial distribution ψ ik over time stamps. Its dimension is the number of time stamps.
All posts and links are associated with a time stamp in our model. A user's topics of interest might change at different time stamps. Therefore, each topic follows a probability distribution over time stamps.
Definition 6: The user popularity distribution of community c is defined by a multinomial distribution over all users. In community c, γ ci represents the probability of user i being interacted by others.
Definition 7: Topic correlation η gy,g y defines the correlation of topics y and y in communities g and g , respectively. Suppose that user i is in community g and focuses on topic y, while user i is in community g and with topic y . η gy,g y means that user i and user i are likely to communicate with each other when they are in the same community and are interested in the same topic. If g equals g while y does not equal y , then η gy,g y is smaller. Although user i and user i are in the same community, they are interested in different topics. Therefore, the probability of generating a link between them is smaller. For the last two situations (i.e., g does not equal g and y equals y ; g does not equal g and y does not equal y ), η gy,g y should be the smallest value, which means that they are unlikely to communicate with each other if they are not in the same community, regardless of whether they focus on the same topics.
To summarize our model, we formulate the problem we solved as follows: given a graph with rich content, we want to derive each user's community distribution and temporal topic distribution. For communities, we want to identify the topic distribution based on the user's social behaviors, network topology and network content.

B. MODEL STRUCTURE
In this section, we first describe our model structure. Then, we explain its three components in detail to show how the above definitions are implemented to solve the problem we defined.
We propose a generative model to accurately generate network topology, link content and node content by utilizing social behaviors. In this model, (i) user community membership, (ii) the temporal topic distribution of users, (iii) community-topic distribution, and (iv) the word distribution of topics are all latent factors. We want to infer them given network topology and content. Fig. 2 shows the probabilistic graphical model of SBCD. It consists of three components: a) generation of node content with time stamps; b) generation of link content with time stamps; and c) generation of all links.

1) GENERATION OF NODE CONTENT WITH TIME STAMPS
Considering the second social behavior, i.e., posting preference, we generate only node content in this component. For example, if a user i publishes a tweet that is never replied to/retweeted by others, the tweet is considered as node content. All tweets of user i are generated as follows. Based on user i's community membership distribution π i , we sample a community indicator c ij indicating that user i belongs to community c ij when he publishes the j-th tweet. Then, we sample a value of another latent variable based on community-topic distribution θ c ij denoted by z ij , which means that the topic of the current tweet is z ij . Finally, we generate two observed variables based on topic-word distribution φ z ij and topic distribution over time ψ iz ij : word list and time stamps of the tweet. Here, we utilize the multitopic preference and temporal variation of topics behaviors.

2) GENERATION OF LINK CONTENT WITH TIME STAMPS
Considering the social behavior of posting preference, we generate only link content in this component. Those tweets that are published to reply to others are considered link content. They are generated in the same way as node content generation. For user i's q-th link, we derive its community indicator g iq indicating that user i belongs to community g iq when he sends this post to reply to someone. Then, we sample a value of another latent variable based on community-topic distribution θ g iq denoted by y iq . This means that the topic of the current link content is y iq . Finally, we generate two observed data based on topic-word distribution φ y iq and topic distribution over time ψ iy iq : word list and time stamps of current link content.

3) GENERATION OF ALL LINKS
For a directed link e ii that is from i to i , its generation is formulated as follows. η gy,g y is a factor of generating VOLUME 10, 2022 FIGURE 2. The graphical representation of our model. e ii because it represents topic correlation with respect to communities. Moreover, the social behavior of reciprocity of interactions indicates that users tend to focus on popular or authoritative users. To integrate the above two factors, we define ω ii as follows: where y is the community indicator of user i and λ i denotes the characteristic (i.e., active or inactive) of user i. λ i is calculated by (out−degree) i (degree) i . Then, a sigmoid function is adopted to calculate the probability of this link.
Because we use the Gibbs sampling method for inference, which makes it difficult to process the sigmoid function, we adopt the P'olya-Gamma distribution to model the sigmoid function [58].

C. GENERATIVE PROCESS
Summarizing the above components, the generative process of SBCD is described as follows. 1) Initialize η randomly; 2) For each topic k ∈ K , a) Sample word distribution from a Dirichlet prior: φ k | β ∼ Dir(β); 3) For each community c ∈ C, a) Sample distribution over topics from a Dirichlet prior: θ c | α ∼ Dir(α); b) Initialize γ c with a Uniform distribution; 4) For each user i ∈ U , a) Sample his community distribution from a Dirichlet prior: π i | ρ ∼ Dir(ρ); b) For each topic k ∈ K , i) Sample distribution over time stamps from a Dirchlet prior: ψ ik | ε ∼ Dir(ε) 5) For each user i ∈ U , a) For each post j ∈ D i , i) Sample community indicator from a Multinomial distribution: c ij | π i ∼ Mul(π i ); ii) Sample topic indicator from a Multinomial distribution: z ij | θ c ij ∼ Mul(θ c ij ); iii) For each word l ∈ W ij , • Sample word from a Multinomial distribution: For each link q ∈ E i , i) Sample community indicator from a Multinomial distribution: g iq | π i ∼ Mul(π i ); ii) Sample topic indicator from a Multinomial distribution: y iq | θ g iq ∼ Mul(θ g iq ); iii) Sample the link from i to i : E t ii | g iq , g i , y iq , y i , η, γ ∼ Ber(σ (η g iq y iq ,g i y i + γ i g i ); iv) For each word r ∈ W iq , • Sample word from a Multinomial distribution: w iqr | φ y iq ∼ Mul(φ y iq ); v) Sample time stamp t iq | ψ iy iq ∼ Mul(ψ iy iq );
Second, we calculate all integrals in (6). The first integral is calculated by (7).

P(π|ρ)P(c, g|π)dπ
where n c i is the number of posts and links that are assigned to community c for user i. n .
i is the number of posts and links that are assigned to all communities.
The second integral is calculated as follows.

(k)
Dc is the number of all posts that are assigned to community c with topic k. n Ec integrate all topics. The third integral is calculated as follows.
where n (w) Dk is the number of times assigned to topic k for word w in all posts. n  (6), where n

B. PARAMETER ESTIMATION
After the Gibbs sampler converges, all parameters are estimated as follows.π ic = n (c) ck + |T |ε (19) Parameter η is calculated by aggregating all community-topic pairs with respect to all links. Parameter γ for each community is calculated by counting the number of links whose target node is in the current community.

C. ALGORITHM SUMMARIZATION AND TIME COMPLEXITY
The inference procedure is shown in the following algorithm. Next, we analyze the time complexity of the algorithm of SBCD. As shown later, it runs linearly in terms of network data (i.e., the number of users, links, topics, vocabulary).
T denotes the number of iterations for convergence. All counters (e.g., how many times a user is assigned to a community) are recorded in memory. In Steps 3-7, community Algorithm 1 Inference for SBCD Require: users U , user posts D with time stamps, links E with content and time stamps; Ensure: community distribution π , topic distribution θ, word distribution φ, topic distribution over time ψ, parameter η, parameter γ ; 1: Initialize α, β, ε, ρ, η, γ ; 2: for iter = 1 : T do 3: for each user i ∈ U do 4: for each post d ij ∈ D i do 5: Sample community indicator c ij according to (11); 6: Sample topic indicator z ij according to (12); 7: end for 8: for each link e ii ∈ E i do 9: Sample community indicator g iq according to (13); 10: Sample topic indicator y iq according to (14); 11: Sample ξ ii according to (15); 12: end for 13: end for 14: for each link e ∈ E do 15: Update η and γ by aggregating community and topic of two endpoint users; 16: end for 17: end for 18: Caculateπ ,θ,φ, andψ according to (16) - (19); indicators and topic indicators for posts of all users are sampled. Steps 5 takes constant time. In Step 6, it takes (|W |) to compute the second fraction of (12) for a specific topic, where |W | is the vocabulary size. Thus, Steps 3-7 take (|U |×|D|×|C|+|U |×|D|×|K |×|W |). Steps 8-12 compute the community indicator, topic indicator and ξ . The number of all links is |E|. Equations (13), (14) and (15) take constant time. Thus, Steps 8-12 take (|E| × |C| + |E| × |K | × |W |). For Steps 14-16, we calculate η and γ . It takes (|E|). Based on the above discussions, the complexity of SBCD is linear to the data size. As datasets become large, we can parallelize our model.

V. EXPERIMENTS
In this section, we evaluate our model's accuracy for community detection and show a case study to illustrate the social behaviors that are investigated. We choose two real datasets and six state-of-the-art baselines. Our experiments are conducted on a personal computer with an Intel Core i7-7700K @ 4.2 GHz CPU and 64 GB RAM.

A. DATASETS
To evaluate the accuracy of the community detection results, we choose two real networks that include all four types of social behaviors. One of them is the Reddit dataset, and the other is the DBLP dataset [60].

1) REDDIT DATASET
Posts are crawled from three sub-forums of reddit.com (i.e., Science, movie, and Politics). These three sub-forums correspond to three communities. The author of each post (including reply comment) is extracted as a node. The sub-forum to which the post belongs is selected as ground-truth community of the author. Therefore, the ground-truth only depends on the truth where a node really appears. Moreover, when a node exists in multiple sub-forums he belongs to multi communities and the communities are overlapped. When a user replies to a post of others, a directed link is generated. Main threads are considered as node content, and posts that are replies to other posts are used as link content. All posts are recorded with time stamps. We divide the dataset into seven snapshots with one day as a time window.

2) DBLP DATASET
It is a paper coauthorship network. The papers are crawled from three research fields, i.e., Machine Learning, Image Processing, and Data Mining corresponding to three communities. The authors are considered as nodes. When several authors publish one paper corporately, links are created among them. The research topics are considered as ground-truth community of the authors. When an author has multiple research fields the communities are overlapped. Therefore, the ground-truth reveals the truth of authors' research fields. When several authors publish one paper corporately, links are created among them. The title of a paper is used as both node content and link content. The dataset is divided into eleven snapshots with one year as a time window.
All datasets are processed by removing stop words and stemming. They are summarized in Table 2.

B. BASELINES
Our model integrates network topology, network content, and user social behaviors for community detection. Therefore, we choose six similar state-of-the-art baselines to evaluate our model's accuracy. Some of them model some user attributes (e.g., documents, roles) to detect the community structure of networks. They are all generative models. The six baselines are described as follows: • Community Level Diffusion (COLD) [48]. It is a generative model that generates network topology and content based on the latent community membership factor. It can identify the diffusion pattern between communities.
• Community Profiling and Detection (CPD) [49]. It integrates friendship relations, diffusion links and individual preferences to identify community profiling. This is the first work to propose community profiling.
It combines the LDA model and Poisson distribution to generate network topology and content. It generates documents and links based on the same latent factor. Because it detects the community structure of VOLUME 10, 2022  a document network, we need to integrate the documents of each user to infer his or her community membership.
• Community Role Model (CRM) [52]. It assigns roles to users. Friendship links and diffusion links are modeled in networks based on users' community assignment.
• Community Detection Considering Group Homophily and Individual Personality of Topics (GHIPT) [62]. It proposes a novel generative community detection model by integrating group homophily and individual personality of topics.
• Topic Correlations-Based Community Detection (TCCD) [63]. TCCD is proposed by our previous work that is extended. It considers the correlations of different topics in the community detection model.

C. METRICS
The output of our method for each node is the probability distribution over communities. We set a threshold (e.g., 0.33 for three communities) to get the community label of nodes. Therefore, we can get overlapping community structure. Since both datasets supply ground truth, we use generalized normalized mutual information (GNMI), F score and the Jaccard index as metrics to evaluate the accuracy of the community detection results. For a dataset with ground truth, the GNMI measure is used to evaluate the accuracy of overlapping community structures [64]. The F score is the harmonic mean of precision and recall: The community number and topic number are set to true values according to the ground truth. η is initiated with random values. Existing works have proven that Dirichlet hyper-parameters have low impact on the efficiency of generative models of our work. Moreover, empirical studies also show that our model is insensitive to hyper-parameters. Therefore, all Dirichlet hyperparameters are initiated by fixed values according to the common strategy (i.e., ρ = 0.01, α = 0.001, β = 0.1, ε = 0.001) [48], [65], [66]. We set the threshold for determining community memberships to 1/|C|. All parameters of the baselines are set to recommended values by the authors. Table 3 shows the comparisons between SBCD and the baselines on two datasets. Overall, SBCD outperforms the other baselines for all metrics.  On the Reddit dataset, there are 3,925 isolated posts, i.e., node content. They are not on links. Therefore, they do not contribute to the generation of links. Our model separates these isolated posts from link posts, such that they do not participate in the formation of network topology. COLD uses all posts to generate links, and it does not discriminate link content and node content. The PMTLM must eliminate isolated posts, or there will be an error. CPD does not consider link content at all, which means that all posts are used as node content and provide no useful information to accurately generate network topology. The results show that SBCD achieves 1.92%, 0.77%, and 0.51% improvements in terms of GNMI, F score and Jaccard over the second baseline, i.e., GHIPT. Compared with our previous work, TCCD published in AAAI, the results are improved by 4.93%, 1.47%, and 1.23% in terms of GNMI, F score, and Jaccard, respectively.

E. COMPARISON WITH BASELINES
On the DBLP dataset, it is quite different from the situation on Reddit. The title of a paper is used by all its authors as node content. Meanwhile, it is also used as the content of the link from one author to another author. Therefore, except for the situation when a paper has only one author, which is a  very uncommon case, node content is used as link content. The results show that SBCD achieves 4.79%, 2.64%, and 4.73% improvements in terms of GNMI, F score and Jaccard, respectively, over TCCD. It achieves a 1.92% improvement in terms of GNMI over the second-best baseline, i.e., GHIPT. The experimental results show that considering social behaviors in SBCD enables better results to be achieved. The results are summarized as follows.
1) Although all baselines utilize topology and network content, our model derives a more accurate community structure by considering underlying factors, i.e., social behaviors that lead to the generation of communities. The results show that the social behaviors we proposed have profound impacts on community structure.
2) Moreover, for those networks without much isolated node content, such as paper coauthorship networks, SBCD outperforms all baselines for all metrics.

F. CASE STUDY
In addition to community structure, we are also interested in community semantics, i.e., the topic distribution of communities, topic correlations between communities and word distribution of topics. Moreover, by considering social behaviors, we further illustrate the following significant information of users on both datasets: user popularity and topic changes of users.

1) TOPIC DISTRIBUTION OF COMMUNITIES
As Fig. 3 shows, the topics movie and Politics are dominant in the communities movie and Politics, respectively. However, for the community Science, although the topic Science is dominant, there are 35 percent of posts talking about Politics and 16 percent of posts talking about movie. Fig. 4 shows that the topics Data Mining, Image Processing, and Machine Learning are dominant in the communities Data Mining, Image Processing, and Machine Learning, respectively. Compared with the Reddit dataset, it is true that a research field involves multiple topics, as studies become increasingly cross-disciplinary.   6 shows topic correlations on the DBLP dataset. Fig. 6(a), Fig. 6(b), and Fig. 6(c) illustrate topic correlations inside the communities of Data Mining, Image Processing, and Machine Learning, respectively. The topics Data Mining, Image Processing, and Machine Learning are dominant topics in the corresponding communities. Moreover, Machine Learning is a hot research topic in all communities. Fig. 6(d) and Fig. 6(e) show the topic correlations across the three communities. They reveal a true phenomenon, that authors in the Image Processing and Data Mining research fields have intense communications with authors in the Machine Learning field.

3) WORD DISTRIBUTION OF TOPICS
We use word clouds to illustrate topics. As shown in Fig. 7 and Fig. 8, all topics identified by SBCD are meaningful.

4) USER POPULARITY
As described in the first social behavior, users with a large influence often attract more attention. This case study verifies ''user popularity'' social behavior by analyzing parameter γ which denotes the popularity of users in each community and is responsible for the generation of links. If a node is popular or authoritative, there will be a large amount of links pointing to him. Therefore, it is the integration of a node's popularity and the contents he published affect community structure and community semantics.
Top 5 authors are selected in each community on the DBLP dataset, as shown in Table 4. Our manual analysis shows that all 5 authors are the most influential researchers in the corresponding research field. Because it is difficult to evaluate the true popularity of users in Reddit, we do not show the case on Reddit.

5) TEMPORAL VARIATION OF USER TOPICS
Users in a community share similar topics, which generates community topics. However, users' interested topics are changing over time, which is modeled by the fourth social behavior. This case study investigates the changing of topics on individual level.
Due to the large number of users, Fig. 9 shows only the topic changes of the first author in the community Image Processing. We can track the change in his research topics, which is significant for better understanding the change in the leading research direction. The DBLP dataset does not include publications at time stamp 2 and time stamp 4. Fig. 9 shows that his studies on Image Processing and Machine Learning are stable except for the above time stamps, and his publications on Data Mining increased at the last four time stamps, i.e., 8,9,10, and 11.

VI. CONCLUSION AND DISCUSSION
First, we investigated and assessed the influence and importance of considering user social behaviors for community detection. Social behaviors exhibit users' habits and make significant contributions to the interactions among users (i.e., the topology structure and content of a network). We proposed four types of user social behaviors (i.e., reciprocity of interactions, posting preference, multitopic preference, and temporal variation of topics). Second, we proposed a novel method (SBCD) by combining user social behaviors, network topology and network content seamlessly in a generative model. It investigates the formation of a network with complex content to infer community structure and community topics. In this model, network content is divided into node content and link content. Third, we evaluate SBCD on two real networks with ground truths and compare it with six state-of-the-art methods. The experimental results show that SBCD improves the accuracy of community detection by considering social behaviors. Finally, SBCD can also identify topics, topic distributions of communities, user popularity, and individual topic changes over time in each community. In the future, we first intend to get a proper synthetic benchmark to evaluate our community detection model. There are two key requirements for synthetic benchmark. (1) The social networks in our paper should include textual content on both nodes and links, which means that there should be a word set corresponding to different topics for each node and each link. The words in the set should be from a meaningful true post. The first three behaviors can be satisfied by existing benchmark generation methods; (2) Benchmark generation methods should generate temporal topics of nodes which should evolve in the similar way as reality. Second, we intend to investigate how community members and community topics evolve as a result of the changing of users' topics.