DCBRTS: A Classification-Summarization Approach for Evolving Tweet Streams in Multiobjective Optimization Framework

The emergence of social media platforms like Twitter has become a prominent communication source in disaster outbreak. NGOs, Government agencies leverage twitter’s open and public features to provide immediate relief. Nevertheless, situational information gets immersed in millions of tweets with varying characteristics. Examining each tweet can be cumbersome and time-consuming. Thus, the efficient extraction of disaster-related tweets and getting information from all the extracted tweets is required. In the current paper, we have developed a novel framework that uses a deep learning-based classification model to separate the situational tweets from others and summarize them in real-time. Our system is a three-phase process: (a) Creating tweet clusters using a representative set of tweets from the initial set of extracted tweets using a multi-objective optimization concept; (b) When a new tweet arrives, the clusters are updated. The new tweet is classified as situational vs. non-situational. If situational, it is assigned to the closest cluster or new cluster. This assignment is based on its weighted average of syntactic and semantic distances and relevancy to the cluster; (c) Summary is formulated by extracting tweets from each cluster. The proposed approach’s superior performance on four datasets related to different disaster-related events indicates the developed framework’s efficiency over state-of-the-art techniques.


I. INTRODUCTION
The increasing popularity of microblogging sites such as Twitter has changed the way people think, live, and communicate [1], [2]. As per a blog 1 in 2013, 400 million tweets are posted per day, and this number has increased to 500 million in 2019 2 . These sites have become a live coverage of valuable information for ongoing events such as current trends, politics, education, and more. Searching a topic can provide a lot of related tweets, which can be informative but sometimes overwhelming. In case of a disaster event, monitoring informative tweets (also called as situational tweets describing the current status of the affected area like 1 https://blog.twitter.com/2013/celebrating-twitter7 2 https://www.dsayce.com/social-media/tweets-day/ the number of causalities, important contact numbers like blood banks) may be helpful for the disaster management authorities to carry out immediate relief operation. However, in order to process incoming tweets to perform quick response operation, two challenges may arise: (a) availability of vast amount of tweets having varied characteristics including sympathy and emotions, personal opinion, among others. In the literature [2]- [4], the importance of situational tweets has already been shown and the importance of separating situational tweets from non-situational is also established; (b) rapid rate of posting such tweets: this may cause the overload problem.
An example illustrating the situational vs. non-situational tweet is given below. The situational tweet shows some important contact numbers, while non-situational shows the sentiment of a person.: Situational tweet: call bsnl numbers 1503, 09412024366 to find out last active location of bsnl mobiles of missing persons in uttarakhand. Non-situational tweet: Shooting was there at an elementary school. I'm loosing all faith in humanity.
The current paper presents a solution by developing a twostage approach namely, DCBRTS, for handling continuous tweet streams. In the first stage, tweet category is identified as either situational or non-situational using a classification framework. After that, an online summarization system is applied on the situational tweets to generate the summary in real-time (RT). Our model develops a deep learning-based classification model utilizing convolution neural network (CNN) [5] that classifies whether a tweet is situational or non-situational. The classifier uses universal sentence (tweet) representation [6] to capture semantics in above two categories. On the other hand, the summarization model resolves the overload problem by summarizing the situational tweets (obtained using classification model) as going through all such tweets is a cumbersome task for the authorities. As time is critical in disaster scenario; therefore, these tasks are performed in real-time so that extracted tweets can be made available to the authorities in a timely manner.
It is important to note that developing a real-time tweet summarization (RTS) system is not an easy task as it has to be efficient (able to handle the large tweet streams) and flexible (able to provide summaries at different breakpoints.) Most of the existing works [1], [3], [4], [7]- [9] have considered a specific trait while summarizing the tweets. For example, the approach for real-time tweet summarization in the paper [3] focuses on maximizing the number of content words (numeral, noun and verbs) using integer linear programming. But, there may be different traits like maximum length of the tweets [4], tf-idf score of the tweets [10], and antiredundancy (to remove redundant tweets from the summary), which can be considered all together to obtain a good quality summary. Note that importance of optimizing all these traits together is shown in the paper by Saini et al. [2]. The paper [2] uses the multi-objective optimization concept to simultaneously optimize above stated traits and uses an evolutionary algorithm (inspired by the biological phenomenon of the nature) [11] for the purpose of optimization. It selects the optimal subset of tweets and considers them as a summary. In the current study, MOOTweetSumm is utilized for selecting the initial set of optimal tweets from a given set of situational tweets.
As our approach is based on real-time tweet summarization; therefore, the obtained set of optimal tweets is passed to our next phase of summarization model, clustering, which creates groups of tweets based on their similarity. The continuous tweet-stream is assigned to different clusters considering the selected tweets (using MOOTweetSumm [2]) as the initial cluster centers. Note that (a) for assignment, the weighted average of syntactic and semantic distance is used; (b) there is some threshold on the maximum number of clusters and the maximum number of tweets in a cluster to avoid the problem of information overload; (c) a tweet is assigned to the closest cluster based on its relevancy to that cluster which is calculated using the cluster size and the distance of the tweet from the cluster center.; (d) if an incoming tweet is closest to ith cluster but can not be added due to its non-relevancy, then a separate cluster is formed; (e) if there are more than the threshold number of clusters, then two clusters can be merged based on their centers' closeness and the number of tweets in both clusters. This is done to retain maximum information where some weight is assigned to the number of tweets in both clusters along with the distance between cluster centers so that clusters having less number of tweets and maximum similarity, can be merged.
Finally, at a certain break-point provided by the user, the summary is generated. For producing the summary, firstly, clusters present at that time-stamp are ranked based on the tweet's score in the respective clusters. The cluster having the highest score will be of rank-1 (high) and so on. Note that we do not consider final tweet representatives as those will keep on changing over time, so a different method is used for cluster ranking. To calculate the scores of tweets, we have used four features: three features are the same as used by MOOTweetSumm [10], and the fourth feature uses the concept of named entity recognition (NER) [12]. The used NER identifies the organization, location, numerals, nouns, verbs, and many more named entities. The weighted sum of these features will contribute to the calculation of tweet-score. We showed paramter variation with this four features. At last, two are used. Considering rank-wise clusters from high to low, high scoring tweets are extracted until the desired length of the summary is reached. Note that the existing works [3], [8], [9] suffer from the drawbacks of using different features for summarization.
Our developed model, DCBRTS, is compared with stateof-the-art techniques. Moreover, we have also developed several baselines to reveal the importance of selecting various approaches at different stages of the proposed approach like (a) utilization of three phases in the developed summarization system; (b) to check the suitability of different clustering algorithms in grouping the incoming tweets; (c) to check the suitability of features or existing summarization algorithm to be used for generating the summary in the last phase and so on. For example, in the last phase of real-time tweet summarization, we have explored various existing algorithms like LUHN [13], LexRank [14], TextRank [15], among others. At each phase of the real-time tweet summarization, the used algorithms are shown in the Figure 1. Thus, in total 30 baselines have been developed.
The major contributions of the current paper are listed below: (1) We propose a classification cum real-time tweet summarization, DCBRTS system to generate summaries at different breakpoints provided by the user; (2) Designing of deep-learning based classifier to separate situational-tweets from non-situational tweets using their semantics; (3) We model the summarization framework as three phase process: selection of optimal set of tweets using multi-objective optimization concept, clustering, and finally, summary generation; (4) We have developed the various baselines to compare our systems and to perform in-depth analysis; (5) Extensive experimentations on real-time Twitter dataset illustrate promising results.
The proposed framework is tested on four disaster datasets. The results obtained clearly illustrate the superior performance of our real-time tweet summarization framework over state-of-the-art techniques.

II. RELATED WORKS
During any disaster event, situational tweets serve as useful source for the management authorities. However, these types of tweets need to be extracted properly for their practical utility and to be summarized in real-time. Here, we have discussed the recent techniques developed for classification and summarization.

A. TWEET'S CLASSIFICATION IN DISASTER EVENT
In the literature [16]- [18], several attempts have been made to separate situational tweets from the non-situational. These approaches use the bag-of-word model for classification. However, their performance is heavily dependent on the vocabulary of the disaster events, or in other words, they use in-domain features extracted from the tweets. To overcome the limitation, in [3], authors have developed a classification model that uses domain-independent lexical features to distinguish among tweets. Nowadays, researchers are moving towards building up a deep learning-based classification model. Authors of Caragea et al. [5] used a convolution neural network (CNN) [19] for classification of disaster-related tweets. Still, it suffers from the drawback of representing tweets using the bag-of-word model. Recently, the paper by Alrashdi et al. [20] developed a bidirectional long short-term memory (Bi-LSTM) [21] model which uses Glove 3 word embedding to represent the tweet.

B. REAL-TIME TWEET SUMMARIZATION
Existing research [4], [10], [22]- [26] considered summarization of the available tweets or in other words, those focused only on developing static summarization techniques. However, what is important during a disaster event is realtime summarization of evolving tweet streams. Some of the recent approaches are proposed in [1], [3], [7], [27]. In [1], [27], firstly clustering of tweets is performed and then, representative tweets are selected from each cluster. Finally, ranking of tweets is performed using LexRank [14] algorithm which is a graph-based approach. Rudra et al. [3] developed a classification (discussed in previous section) and summarization model. For summarization of situational tweets, they have used the integer linear programming to maximize the number of content words in the summary. Osborne et al. [28] proposed a real time event tracking system using greedy summarization. In [7], authors proposed an abstractive summarization method using graph-based scheme. In [29], authors proposed a real-time tweet summarization method which considers three criteria namely, novelty, informativeness, and relevance with regards to the user's interest for summary generation.

C. ADVANTAGE OVER PRIOR STUDIES
The current work has the following advantages over existing works: 1) To develop a situational tweet classification model, unlike [3] where only syntactic features are used, in the current work, the universal sentence (tweet) vector representation released by Google was used. 2) For summarizing evolving tweet streams, our approach (a) uses the multiobjective evolutionary algorithm to select the optimal set of tweets from the initial collection of tweets; (b) while performing clustering, both syntactic and semantic distances are considered which was absent in [27]; (c) while storing tweets from starting to the end, there may be the problem of storage overhead. This is removed by putting a threshold on the maximum number of tweets a cluster can retain. This was also done to keep the updated information in the clusters as the tweets arrive. (d) selection of tweets from the final clusters using weighted sum of various newly developed features (different from [3]).

III. CLASSIFICATION MODEL FOR TWEETS
This section focuses on classifying situational tweets from the non-situational tweets using the supervised classifier. For training such classifiers, gold-standard data is required related to the disaster domain. We have collected the annotated data from two sources described below where different human-made and natural disaster events from all over the world are mentioned. Both sources include a variety of disaster events.

A. DATASETS USED FOR CLASSIFICATION
Tweets in the datasets used for classification are divided into several categories and then, they are labeled as situational/non-situational based on their category. The detailed description of annotated data collected from different sources is provided below. 1) CrisisLexT26 4 : It includes 1019 unique tweets belonging to different natural disasters like floods, earthquakes, typhoons, haze, and human-made disasters like a terrorist attack, train crashes, explosion, fires, and more. The category-wise statistics for this dataset are shown in Table 2. 2) CrisisNLP 5 : In includes a set of 15152 unique tweets related to natural disaster like floods, earthquakes, ty-  Table 3. The label assignment of situational vs. non-situational is also shown. The distributions of situational vs. non-situational tweets are shown in Table 1. There are some issues in these datasets discussed in subsequent sections and resolved using the majority voting concept. The statistics are shown before and after the application of majority voting, a type of ensembling method in machine learning.

B. ISSUES WITH EXISTING DATASETS AND RESOLUTION
Although the above-discussed datasets' annotators have annotated the tweets with only one category, a tweet can sometimes be a mixture of both situational and non-situational segments. For instance, the tweet: RT live2Tripoli: 400 people have died in the Balochistan earthquake. May God have mercy on all their souls. #Pakistan #Calamity, is labeled under the injured or missing people category in the CrisisNLP dataset, and hence, it can be considered situational as per the label given. However, the first line (showing causality) of the tweet is situational, while the second line (showing sympathy) is non-situational. We have exploited the majority voting concept to counter this problem and utilized the existing pre-trained SVM classifier [3] trained on (four) disaster events. In this classifier, the tweet is fragmented and then classified into two categories using the lexical features and syntactic features like count of question marks, personal pro-  [3]). The training data used in this SVM classifier consisted of very less number of tweets. We aim to develop a generic and efficient classifier using deep learning, which requires a large amount of training data. The descriptions of the steps used for majority voting are provided below: 1) We have pre-processed and fragmented the tweets of the collected datasets. After fragmentation, both segments were labeled with the class label as that of the original tweet. 2) Utilizing the existing SVM classifier [3], fragmented tweets are classified into either situational or nonsituational. 3) Finally, if original labels and labels generated using existing SVM classifier are the same, then the tweet is included in the training dataset. The number of situational and non-situational tweets in each dataset obtained after majority voting is shown in Table 1. The number of tweets in 7:3 ratio for training and testing data, respectively, are also shown in the same table.

C. CLASSIFICATION FEATURES AND MODEL
To make our classifier more robust and efficient than existing classifier [3], we have developed deep learning-based classifier, i.e., convolution neural network [19] (with Sigmoid Focal Cross Entropy as the loss function (to avoid class imbalance problem)), which was trained on datasets (shown in Table 1) containing tweets from 25 disaster events. Nonsituational tweets mostly comprise sentiments like grief, sorrow, hatred, and anger. To capture these sentiments, we have considered semantic features using Google's pre-trained universal sentence encoder model [6].

D. CLASSIFICATION PERFORMANCE
In Table 4, results attained by the developed CNN-based classifier is shown in terms accuracy, precision, recall and F-measure. Here, we have utilized semantic features (tweet representation using Universal Sentence Encoder). Our proposed model can perform equally well in cross-domain classification. It can be observed that on the test datasets corresponding to CrisisLexT26 and CrisisNLP, the F-measure value on average is 87% which proves its efficacy for the in-domain datasets. More discussions about our classifier's performance on the cross-domain datasets are presented in Section VI-A.

IV. REAL-TIME TWEET SUMMARIZATION
Let us assume that we have obtained the set of situational tweets between timestamps t1 and t2 (t2 >> t1) after applying the classification module. We have solved the process of summarizing tweets in real-time as a three-phase process. First-phase involves initializing the tweet cluster centers using the initial set of tweets. Second-phase occurs when new tweets arrive where they are assigned to the existing clusters or a new cluster is formed. The third phase starts whenever the user requires a summary. This phase comprises of selecting optimal tweets for the final summary. The challenges associated with each phase are discussed in detail in the subsections IV-A, IV-B and IV-C. However, the primary objective is to summarize the tweets in realtime with memory optimization. We have explored several possibilities for each phase, represented in the Figure 1, and described in the relevant sections.

A. INITIALIZATION OF TWEET CLUSTERS
In the first phase, the task is to create tweet clusters using an initial stream of situational tweets (let it be t2 − K tweets). The performance of our summarization model predominantly depends on this selection process as the clusters will be input to our next phase to determine the final cluster structure. For this purpose, we have used an existing multi-objective optimization based algorithm, namely, MOOTweetSumm [2], discussed below.

1) MOOTweetSumm
This algorithm was designed in the sense that while selecting tweets or a summary, a single tweet may be optimal considering one perspective but may not be from other perspectives. Therefore, to make the decision process faster, multiple objectives should be considered and those should be optimized simultaneously to select a good set of optimal tweets. Here, we have used two objective functions: tf-idf score of the tweet (Ob1) and anti-redundancy (Ob2) such that For optimization, a multi-objective binary differential evolution algorithm (MOBDE) [30] is utilized, which is an evolutionary algorithm [31]. It starts from a set of binary solutions where each solution has a maximum length equals to a given set of situational tweets, and the maximum number of ones can not exceed the desired number of optimal tweets. An example of solution representation is shown in Figure 2 where 1 indicates that the tweet at that index should be in the summary. Each solution is associated with the above two objectives, and MOBDE optimizes these solutions using the iterative procedure. The efficacy of this concept over others is illustrated in the paper [2]. Motivated by this, we have acquired this concept and used it for the selection of the optimal set of tweets.
The selected set of optimal tweets representing the initial stream of tweets will then be utilized as the initial cluster centers. The remaining tweets are assigned to these clusters based on minimum average weighted distance (cosine distance obtained after tf-idf vectorization and Universal Sentence Encoder representation) between a tweet and the cluster center (refer to Eq. 3).

B. UPDATING CLUSTERS
In the second phase, we update the initial clusters formed whenever a new tweet arrives. The new tweet can be merged into an existing cluster whose center is closest to the tweet, or a new cluster will be formed. The main challenge in this phase is finding if the new tweet is similar enough to be merged with the existing cluster. We have used various heuristic approaches where we define a dynamic threshold. If the distance between the cluster center and the new tweet is greater than the threshold, a new cluster is created.

1) m-Birch
We have utilized existing m-BIRCH (modified-Balanced Iterative Reducing and Clustering Using Hierarchies) clustering algorithm which is recently developed by Madan et al. [32]. Noted that m-BIRCH is an online clustering algorithm to cluster large datasets in an incremental way and designed to enable data-driven parameter selection and effectively handle differing density reasons. Here, the initial number of clusters equals to the number of optimal tweets selected using MOOTweetSumm, but this can be increased or decreased as the stream of tweets arrive. Let kth cluster have {t 1 , t 2 , . . . , t M } tweets then Clus_Size k is the size of the kth cluster which is calculated as, where, M is the number of tweets in ith cluster, L is the summation of all tweets vectors, s = is the summation of squares of all the components of the tweet vectors, t v k is the vector representation of kth tweet. To capture the syntactic and semantic information present in the tweets, we have used the well-known tf-idf [33] and recently developed universal sentence encoder [6] vector representation, respectively. Therefore, each cluster has two cluster sizes and two cluster centers using syntactic and semantic representations. For example, for kth cluster, cluster size is denoted as Clus_Size 1 k (using syntactic) and Clus_Size 2 k VOLUME X, 2021  (using semantic), while cluster centers are denoted as c 1 k and c 2 k . The concept of using two vector representations is like a multi-view learning which states that when the same object (tweet) is seen from any angle then it should belong to the same cluster [34]. When a tweet t j is to be assigned to kth cluster then we consider the average distance as, where, d 1 tj ,c 1 k is the cosine distance (1-cosine_similarity) between tweet, t j , and kth cluster center, c 1 k , in syntactic space. Similarly, d 2 tj ,c 2 k is computed in semantic space. When a new tweet, t m , arrives, it's probability of belongingness to the situational category using the developed classification model is first computed. If it belongs to this category, then the following steps are executed • Find the closest cluster using the shortest average distance criterion (Eq. 3). Let it be the ith cluster.
, then, a new cluster is created, else, it is merged to the same cluster. Here, B is the bounding parameter to control the merging of incoming tweets into the existing clusters. For example, if a cluster is imagined as a sphere with radius r, then we want the new tweet to be present in the radius of r × B.
• If the number of tweets in ith cluster is greater than threshold, then unique threshold number of tweets which are closest to the centre are considered. This threshold was kept to reduce information overload and to store updated information. • If the number of clusters is greater than a threshold, two clusters are merged until the number of clusters becomes less than the threshold. To determine which clusters should be merged, we determine the distance between two clusters as the sum of the weighted distances between cluster centers and number of tweets in both clusters divided by the maximum number of possible tweets in a cluster. This was done to merge those clusters which are semantically similar and have less number of tweets.

C. SUMMARY GENERATION
Whenever a user demands a summary, we have considered the tweets in obtained clusters after the second phase for realtime summary generation. In order to do this, we have to select a set of tweets of varying characteristics that contain most of the information. In other words, firstly, the clusters and tweets in each cluster are ranked and then, the top ranked tweets from each cluster are selected in an extractive way considering rank-wise clusters.

1) Tweet-ranking
Our developed model is based on extractive summarization [10]; hence, from each cluster, tweets are extracted. Therefore ranking of clusters and ranking of tweets in a cluster are required to be performed. For this purpose, firstly, we have computed the tweet's score in each cluster using a weighted sum of four features. Then, the average scores of the tweets belonging to a cluster will be the score of that cluster. Higher the score, the higher will be the rank (rank-1 is considered as the highest). Let kth cluster have M tweets, {t 1 , t 2 , . . . , t M }, then, tweet-scoring feature for a tweet t l is described below 1) Anti-redundancy (F 1 k t l ): It is used to remove the redundancy in a summary. For a tweet in the cluster [2], it should be diverse from others in the same cluster; therefore it is computed as Here, D is the average distance between two tweets in syntactic and semantic space, as described in Eq. (3), is the total number of tweet pairs in the same cluster.
2) MaxSumTFIDF (F 2 k t l ): A tweet's score highly depends on relevance of its words [2]. Therefore, we have computed the sum of the tf-idf scores of different words in the tweet, which will be used as the tweet score.
3) MaxLength (F 3 k t l ): In the literature [2], [4], a tweet having maximum length is shown relevant in summary generation. Therefore, this feature is considered into account. 4) CountNamedEntities (F 4 k t l ): In disaster event, named entity recognition plays a major role [9] as it identifies location, organization, numerals, and many more. Therefore, we have counted the number of NERs present in the tweet and divided it by the total number of NERs present in the tweet data to normalize it. Mathematically, it is represented as where, Q is the number of NERs present in the tweet data.
Thus, the final score of tweet t l in kth cluster will be where, α, β, γ, and λ are the weight factors assigned to different features. Note that for F 2, F 3, and F 4 features, high scores are desired, while for (1), low score is desired. Therefore, to make the weighted sum higher, the value of the feature F 1 is reversed. After evaluating tweet's score, high scoring tweets are extracted considering rank-wise clusters, until we get the desired number of tweets in the summary.

V. EXPERIMENTAL SETUP
In this section, we have discussed the datasets, experimental settings, evaluation measures followed by comparative methods.

A. DATASETS
For the purpose of experimentation, we have used four disaster events, including natural and human-made disasters that occurred in different regions of the world. Each dataset is available as a set of 5000 continuous tweet streams with other information like time, date. These datasets are briefly described below: abad city of India. The same datasets 6 are used by the paper [3]. As these datasets are designed for real-time tweet summarization; therefore, gold summaries are provided at two breakpoints of 2000 and 5000 tweets. To identify the NERs present in the tweets, we have used the spacy 9 package of python designed for different natural language processing tasks. Initially, we assume that we have a set of 600 tweets and then, tweets keep on arriving one-byone.
For rest of the parameters like maximum number representative selection using MOO-based approach, maximum number of clusters and tweets in the clusters, weight factors assigned to different features used in calculating the tweetscore, bounding factor used in clustering, the best values are selected after performing an ablation study as reported in Section VI-C.

C. COMPARATIVE METHODS
For comparison purpose, we have considered COWTS [3] approach for summarizing disaster specific events in realtime. COWTS focused on extracting tweets having the maximum number of content words (nouns, numerals, and verbs). Note that this approach also classifies tweets as situational or non-situational and then summarizes situation tweets. But, 6 http://cse.iitkgp.ac.in/ krudra/disaster_dataset.html. 7 https://www.tensorflow.org/guide/keras 8 https://github.com/nsaini1988/Microblog_Summarization 9 https://spacy.io/usage/linguistic-features VOLUME X, 2021 in comparison to ours, it is not efficient in terms of memory optimization as it does not discard any situational tweets. In addition to COWTS, we have developed several baselines of our proposed approach, DCBRTS, by varying methodologies used in different stages/phases, to prove its efficacy. The possible baselines are graphically shown in Figure 1 and discussed below: • Initialization of tweet clusters (Phase 1): As discussed in section IV-A, the initial stream of tweets are clustered.
We have used two well-known clustering algorithms for the same.
-DBSCAN 10 : Density-based clustering is most commonly used non-parametric algorithm. Given a set of points in some space, it groups points closely packed together, identifying points lying alone in low-density regions as outlier points. DBSCAN requires a parameter eps which was set to 0.2 and min samples equal to 5. -Hierarchical Clustering 11 : It is a cluster analysis method that explores to build a hierarchy of clusters. We have used the agglomerative bottom-up approach, where each observation is initially its cluster. Then, moving up in the hierarchy, the pairs of clusters are merged. It requires number of clusters to be created as a parameter which was set to number of clusters initially created by DBSCAN.
• Updating Clusters (Phase-2): We have devised another approach to determine if a new tweet should be merged in the existing cluster or a new cluster is to be formed. As compared to the m-BIRCH, the difference lies in the determination of cluster size. Here, we have defined it as the average of the cosine distance from the closest cluster center to other cluster's center. Let the cluster centers be {t 1 , t 2 , . . . , t M } and M be total number of clusters, then Clus_Size k is the size of the kth cluster which is calculated as, where, d i,k is the cosine distance (1-cosine_similarity) between tweet center, t i , and t k cluster center, c 1 k , in syntactic space. Similarly, d 2 tj ,c 2 k is computed in semantic space. • Summary Generation (Phase-3): We employed existent summarization approaches where all the tweets in the clusters are passed as input to the approach. The approaches used are presented below: -LexRank 12 : LexRank is an unsupervised graph based commonly used approach for automatic text summarization where graph method is exploited to score sentences. 10

D. EVALUATION MEASURE
For evaluating the quality of the summary, we have used the well-known ROUGE-N metric, which counts the N-gram overlapping words between predicted and actual summary. More specifically, we actually used ROUGE-L F-score which is longest Common Subsequence (LCS) based statistics. Note that baseline papers reported ROUGE-1 F-score as an evaluation measure. However, by just using Rouge-1, we are only scoring whether single words overlap in the predicted output and the ground truth. As all tweets are around a particular topic, this seems to be a straightforward objective. Hence we have utilized ROUGE-L F-score.

VI. EXPERIMENTAL RESULTS
This Section will describe the results of our two-phase dynamic summarization approach on the datasets discussed in Section V-A. Note that the authors of COWTS method have also developed a classification cum summarization technique; therefore, we have executed the code of COWTS to obtain the results.

A. CLASSIFICATION RESULTS
To evaluate the developed CNN-based classification model's performance, we have selected the cross-domain datasets discussed in Section V-A. Note that these datasets are not used as parts of the training data. As can be analyzed from Table 5, our classifier's performance is also better on the cross-domain datasets because the training dataset consists of various natural and man-made disaster-related information.
On the other hand, using the existing SVM classifier [3] utilizing lexical and syntactic features, average accuracy over four datasets was reported as 79.5% (reported in [3]), which is relatively less in comparison to our classifier. This proves the efficacy of our deep learning-based classifier over existing classifiers.

B. REAL-TIME SUMMARIZATION RESULTS
In Table 6, a comparison of our proposed algorithm for realtime tweet summarization is shown with COWTS, at two breakpoints of 2000 and 5000 tweets. Since the baseline paper COWTS reported ROUGE-1 F-Score, we executed the code of COWTS again. All algorithms are unsupervised in nature. It is evident that our approach performs better than existing ones. For instance, considering mean Rouge-L F-score over all datasets, our method improves by 4% over COWTS over all datasets. The higher scores attained by our approach over COWTS indicate that our three-phase dynamic summarization system, i.e., selection of representative tweets using multi-objective optimization, m-BIRCH online clustering algorithm, and summary generation using various features, along with CNN-based situational tweet selection approach, are better for generating real-time summary.
We have presented detailed results over individual datasets at both 2000 and 5000 breakpoints obtained using different variant of our proposed approach, DCBRTS in Tables 7 to 12. Note that our proposed model is a three-phase model where we have explored 3, 2, 5 possibilities (changing the working scenario) in the first, second, and third phase model, respectively. Each table shows the ROUGE-L F1-Score obtained for a combination of possibility from the first and second phases. The average results over all datasets are reported in in Table  13.
To illustrate the nature of summary, we have shown an example of generated summary in Table 14 for HBlast dataset at a breakpoint of 2000 tweets, in comparison with corresponding gold summary. The matched lines are shown by same colours (excluding black colour). We have only highlighted complete tweets which are occurred in both the summaries. This generated system has Rouge-L F-score score of 0.56.
To check the cluster qualities at different breakpoints, we have plotted the average of the compactness of the clusters at both the breakpoints. Note that we have utilized both universal encoder (semantic) and tf-idf (syntactic) representation while calculating distance (refer to Eq. 3). As each cluster is expected to have maximum of threshold number of tweets to avoid information overload due to continuous tweet streaming and cosine distance can have value between 0 to 2, maximum compactness for a cluster is expected to be threshold * 2. The average compactness of clusters formed at 2000 and 5000 breakpoints for all datasets are shown in Figure 5. In this Figure, part (a) and (b) illustrate the average compactness using semantic and syntactic representations, respectively. The values shown clearly suggest that clusters created are compact and are of good quality.

C. SENSITIVITY ANALYSIS
In this section, we have shown the sensitivity analysis on various parameters like bounding factor, maximum number of clusters, and many more.

1) Effect of bounding factor (B):
We have used B as a bounding factor while updating the clusters. If B is small, then the probability of merging the new tweet into the existing cluster decreases, resulting in the formation of new clusters. If a large number of new clusters are created, it will be computationally expensive to merge them. On the other hand, if B is large, most of the tweets will be absorbed by the existing clusters affecting cluster quality and compactness. Table 15 shows the number of operations done when two clusters are merged into one cluster (cnt_merges) and the number of times when new tweets are added into the existing cluster (cnt_additions). Here B=0.6 is shown to have a good balance.
2) Effect of Nmax: Figure 6 depicts the change of Rouge-L F-Score with change in maximum number of clusters. When N max is small, Rouge-L F-Score is less due to substantial loss of information. When N max is too large, clustering becomes slow due to large number of clusters. Also, storage overhead is higher for large N max . A balanced value for N max is 40 which is used for generating results.

3) Effect of maximum number of tweets in a cluster
As maximum number of tweets in a cluster increases, more information is stored in a cluster and summary quality improves. Large value of the same results in information dissipation which can be missed by summarization algorithm. Also, it increases computations and storage overhead. As shown in Figure 3, 40 is selected as the optimal value of this parameter.

4) Effect of number of tweets used for creating clusters:
Initial number of tweets used for creating clusters determines the cluster stream quality for further processing. Though the summary quality remains almost similar with change in the number as shown in Figure 4, we selected 600 tweets for initial clustering to ensure good dynamic partitioning is created. VOLUME X, 2021

5) Ablation study for tweet-scoring features:
An ablation study for weight factors (α, β, γ, and λ) assigned to various tweet-scoring features (Anti − redundancy, M axSumT F IDF , M axLength, and CountN amedEntities) is shown in Table 16. From this Table, it is evident that M axSumT F IDF and CountN amedEntities features, both having equal weightage of 0.5, helped in increasing the summary quality as the number of arriving new tweets increases. Hence, these features with equal weight-ages are considered for summary generation in the reported results.

VII. CONCLUSION
The current article presents a novel framework for classification followed by summarization to handle the continuous tweet streams posted during disaster events. This system can help the disaster management authorities to perform the immediate relief operation. For classification, a deep learningbased classifier is proposed, which identifies the situational tweets using semantic and syntactic features. The concept of ensemble learning (majority voting) is also utilized which takes help of existing classifiers to design such classifier. The identified situational tweets are then used as inputs to the real-time summarization system. We have derived various key-insights from the developed summarization framework: (a) selection of representative tweets from the initial set of situational tweets using multi-objective evolutionary algorithm helps in providing a right direction for optimal summary formulation; (b) the use of online clustering algorithm helps in clustering the incoming tweets and by putting a threshold on the maximum number of clusters and the maximum number of tweets in a cluster help in minimizing information overload; (c) tf-idf score and count of named entities, both together help in generating better summary than COWTS and Sumblr. In terms of improvement, considering mean Rouge-L F-score over all datasets, our method improves by 4% on an average over COWTS.
In the future, we would like to extend the work for sentiment aware real-time microblog classificationsummarization framework and its application to multiple regional languages so that it can be beneficial to the com-VOLUME X, 2021