Exploring Pattern Mining Algorithms for Hashtag Retrieval Problem

Hashtag is an iconic feature to retrieve the hot topics of discussion on Twitter or other social networks. This paper incorporates the pattern mining approaches to improve the accuracy of retrieving the relevant information and speeding up the search performance. A novel algorithm called PM-HR (Pattern Mining for Hashtag Retrieval) is designed to first transform the set of tweets into a transactional database by considering two different strategies (trivial and temporal). After that, the set of the relevant patterns is discovered, and then used as a knowledge-based system for finding the relevant tweets based on users’ queries under the similarity search process. Extensive results are carried out on large and different tweet collections, and the proposed PM-HR outperforms the baseline hashtag retrieval approaches in terms of runtime, and it is very competitive in terms of accuracy.


I. INTRODUCTION
A hashtag is a type of metadata tag widely used on social networks, e.g., Twitter or Facebook.Hashtags allow users to easily find the message within a specific topic, without employing any markup languages or formal taxonomy [1].HR (Hashtag Retrieval) aims at finding the relevant hashtags for the given query from a corpus of tweets, and can be considered as one of the major information retrieval problems [2], especially their variants for social networks have rapidly grown in recent decades.Several algorithms [3]- [5] of HR were developed and most of them require to scan all the tweets to then determine the similarity of each hashtag according to the user's query.A ranking function is provided to compute and derive the most relevant hashtags, which takes a polynomial complexity O(m × n) for m published tweets and n target hashtags.However, the accuracy is often reduced when dealing with large corpus of tweets.This is due to the fact that these approaches ignore the correlation among the set of hashtags, and use traditional search strategies for finding the relevant hashtags.The core of these strategies consists The associate editor coordinating the review of this manuscript and approving it for publication was Keli Xiao .
of scanning the set of all tweets and calculate the similarity between each tweet and the user's query.This process is prohibitive for large collections with large number of tweets, and large number of hashtags.

A. MOTIVATION
Data mining is used to solve the variants of realistic problems, such as Business Intelligence [6], [7], Ontology Matching [8], Constraint Programming [9], [10], and Information Retrieval [11], [12].The data mining techniques used in information retrieval aim at discovering knowledge from a collection of documents according to a users' query.For example, the classification, clustering, frequent itemset/ association-rule mining classifies new documents, partitions documents into similar groups, and discovers frequent terms from the collection of documents, respectively.Although those approaches achieve better performance in terms of runtime, they only focus on the problem of information retrieval for documents.To the best of our knowledge, there is no work, which explores pattern mining for solving HR problem.Only two works which explore pattern mining on twitter and microblogging analysis are proposed in [13], [14].The first one proposed a pattern mining approach for retrieving information from the microblogging datasets.However, this solution is limited to the microblogging environment.The second work proposed a pattern mining solution for retrieving information from a tweet collections.However, this work retrieves any kind of information and does not study the correlation from the set of hashtags among the tweets.The main motivation of this research study is the success of the existing pattern mining algorithms in improving the performance of the document information retrieval problem.Therefore, to address the limitations of the existing HR solutions, this paper proposes a new framework named PM-HR (Pattern Mining for Hashtag Retrieval), which investigates several pattern mining problems in hashtag retrieval.

B. Contributions
The major contributions of this paper are threefold: • The corpus of tweets is first transformed into a transactional dataset by developing two strategies based on trivial and temporal transformations.
• Several pattern mining algorithms have been incorporated for solving the HR problem such as frequent, closed, and maximal itemset mining, and both high utility and high average utility itemset mining.
• Experimental validation on large corpus of tweets reveals that the PM-HR outperforms the baseline HR approaches in terms of runtime and is very competitive in terms of accuracy.

C. OUTLINE
The remainder of the paper is as follows.Section II reviews the main HR approaches.Section III formulates the hashtag retrieval problem.Section IV explains the overall design of the PM-HR framework.Section V presents the experimental evaluation.Finally, Section VI draws the conclusions and discusses opportunities for future work.

II. LITERATURE REVIEW
Hashtag analysis is a hot topic in data mining and machine learning communities.In the last decade, many applications have been proposed such as hashtag recommendation [15], [16], hashtag-based story detection [17], [18], and microblogging retrieval [13], [19], [20].In this research study, we focus on the hashtag retrieval problem.This section reviews the many works proposed to date for solving the HR problem [21]- [24].
A. HASHTAG RETRIEVAL Efron [3] discusses the dynamics of the tweeting process by developing a modeling language approach to retrieve relevant hashtags, which can be used to improve the search performance on Twitter.It designs a new query expansion strategy called HFB (Hashtag FeedBack query model) to define relationships between different hashtags published on a microblogging environment.Efron then developed a new method [4] to generate a multiple microblog posts that are relevant to a given query.Although these two approaches can retrieve the relevant posts, those methods could not provide a good mechanism to define relationships among varied hashtags; only the unique characteristics of microblogs are considered.This consequently reduces the overall performance of the hashtag retrieval process.Li et al. [5] proposed a machine learning framework to discover the relevant hashtags in the health domain.It uses a deep learning approach to classify the tweets by the distribution representations of the words, which aims at optimizing an objective function for the likelihood of word occurrences.It first performs pre-processing to clean tweets by removing URLs (Uniform Resource Locator), unifying dates, and removing special characters except the # character.The cosine similarity score is then computed between all hashtags and the health keywords query.The hashtags are then finally ranked according to the similarity scores, and the most relevant hashtags are returned to the user.This strategy can also be widely used in NLP (Natural Language Processing) [25].Wang et al. [26] proposed LBP (Loopy Belief Propagation) for hashtag retrieval sentiment analysis.A new graph representation describes the features related to unigrams, punctuation, sentiment lexicon and polarity classification of text.This graph representation can be used to define co-occurrence and the literal meaning of hashtags.Luo et al. [27] investigated data driven approach to enhance the hashtag retrieval problem.Structural information can be used as features for finding relevant hashtags in the ad hoc scenario.Experimental results showed that this ranking approach achieved high accuracy against existing methods.Tariq et al. [28] use the discriminative term-weight approach to derive relationships between the set of topics and terms.The discriminative weights are first assigned to terms, and the input feature space is then transformed to discriminative information spaces using opinion pooling technique [29].A learning model is then applied to these spaces for finding suitable hashtags to the given user.Bansal et al. [30] proposed a semantic hashtag retrieval approach to improve the accuracy of the resulted hashtags.The set of hashtags are first segmented, and each group is linked to Wikipedia to enrich the semantic search.This approach requires high computation cost and memory usage for the segmentation and semantic search process.Only two works which explore pattern mining on twitter and microblogging analysis are proposed in [13], [14].The first one [13] proposed a pattern mining approach for retrieving information from the microblogging datasets.However, this solution is limited to the microblogging environment.The second work [14] proposed a pattern mining solution for retrieving information from tweet collections.However, this work retrieves any kind of information and does not study the correlation from the set of hashtags among the tweets.

B. DATA MINING TECHNIQUES
Beil et al. [31] developed the HFTC (Hierarchical Frequent Term-based Clustering), which applies the association-rules discovery process to information retrieval.It first extracts the frequent itemsets then models the discovered information by the terms of the collection of the documents.The most frequent itemsets are considered as clusters, and each frequent itemset is treated as one cluster containing the relevant documents.Fung et al. [32] proposed a FIHC (Frequent Itemset-based Hierarchical Clustering), which uses the frequent itemsets to construct the hierarchical tree.The tree is then represented as the collection of documents.The experiments reveal that the execution time of the user's requests can be greatly reduced.Yu et al. [33] presented a new algorithm called TDC (Transaction Decomposing Clustering) to improve the quality of the classification of the documents.It dynamically generates the different topics of the collected documents using only the closet frequent itemsets.This approach can reduce the execution time compared with the FIHC algorithm.TDC uses an intelligent structure that allows to construct the different links between each k-itemset with the (k-1)-itemset, hierarchically.This approach shows high precision, but an overlapping problem between clusters can be thus caused while the terms of the documents are highly linked.Babashzadeh et al. [34] proposed a new ARMIR algorithm for text processing.In this approach, a given request is modeled by the set of concepts where the relations between concepts of the same request are determined by an association-rule mining process.Zhong et al. [35] proposed PTM (Pattern Text Mining) algorithm to improve the comprehension of the user's request using the patternmining algorithm.The taxonomy of the patterns is discovered by applying the closed-based algorithm in the training set.This technique reduces the noise between the user's request and the set of the collected documents.Djenouri et al. [12] proposed BSOGDM-IR (Bees Swarm Optimization Guided by Data Mining for Documents Information Retrieval), using the computational intelligence approach (i.e., BSO) and data mining techniques to improve the runtime performance.The collected documents first grouped into several clusters using k-means algorithm.The frequent itemset mining is then applied on each cluster to extract the relevant terms.Each bee in BSO explores the obtained clusters to find the relevant documents guided by the extracted knowledge from the previous steps.Djenouri et al. [11] developed ICIR (Intelligent Cluster-based Information Retrieval) to investigate the frequent closed itemset mining on cluster-based information retrieval to find the closed frequent terms in each cluster.Four alternative heuristics are then suggested to select the most relevant clusters.Zingla et al. [36] proposed HQE (Hybrid Query Expansion) approach that combined external hashtags resources and association-rule mining for retrieving the most relevant texts from microblogs.Association-rule extraction is first applied on the text microblogging collection to generate the candidates.The original query is then transformed as the candidates using external knowledge source.The relatedness between the query and the set of candidates is finally determined using explicit semantic analysis measure.

C. DISCUSSIONS
Based on the aforementioned works, we can conclude that (1) Most solutions of hashtag retrieval apply the classical techniques in information retrieval, and (2) Only classical data mining such as clustering, frequent-itemset mining, and association-rule mining are considered to retrieve the relevant documents.Thus, the accuracy of the hashtag retrieval is reduced, in particular when the published tweets are surrounded on different domains.New methods are needed to address the limitation of hashtag retrieval approaches.Therefore, we propose a hybrid framework that integrates the pattern mining to retrieve the relevant hashtags.Table 1 provides a classification of the existing solutions for solving the information retrieval and their limitations.

III. PROBLEM FORMULATION
To define the hashtag retrieval problem, we need a few preliminary definitions.We consider a collection of tweets, where each tweet is represented by a subset of hashtags.We also provide a set of user queries.The hashtag retrieval problem has the goal to respond and satisfy the user queries.
where each query q i is composed by the set of terms {t 1 , t 2 , . . ., t r }.The HR problem aims at finding for each query q i ∈ Q, the most relevant subset of tweets such that ⊂ We also provide a ranking function, which maps the score value of each tweet for a given user query.
Definition 2 (Ranking Function): Let us consider a function f : × Q → R + , that determines the score for each tweet i ∈ according to a given query q j ∈ Q, we denote the result f ( i , q j ).The ranking function Rank aims to rank the scores of the tweets for each given query q.Given Definitions 1 and 2, top k HR problem aims at finding for each query q i , a subset of tweets such that Given an example of published tweets represented in Table 2.Note that # is the starting symbol of each hashtag.Consider the query q: Italy, the top 1 HR problem returns = { 4 }.Since solutions to HR problem use similarity search approach, where it requires O(| | × |H| × |Q|), it is high time consuming for real-world scenarios.For instance, if we consider the Football corpus containing 3, 000, 000 tweets, and 90, 660 hashtags, and for 1, 000, 000 user queries, the number of possible matching is 27 × 10 16 , which is considerably huge for the existing supercomputers in online query processing.To deal with this challenging issue, the next section presents a new model for solving the top HR problem more efficiently.

IV. GENERAL FRAMEWORK
This section presents the proposed PM-HR approach, which employs pattern mining to improve the quality of retrieval process in hashtag.The designed approach consists of three main steps: i) Transformation step, consists of translating the set of hashtags to the transactional database.ii) Mining step, aims at extracting the relevant patterns from the transaction database created in the previous step, and iii) Searching step to find the relevant tweets according to users' query using the discovered patterns.The knowledge-based system is designed from the relevant patterns to deal with large scale number of user queries in real time.Figure 1 overviews the PM-HR framework.

A. COLLECTION
This stage creates the corpus of published tweets from the users' tweets.The Twitter Java API is integrated to first retrieve and store the tweets into a JSON (JavaScript Object Notation) file, which is parsed to extract the hashtags for each tweet.The tweets are stored according to the publication time.Thus, we collect each timestamp of each published tweet, and then we sort the tweets according to the timestamp in a ascending order.NLP (Natural Language Processing) [25] may be incorporated to refine the extraction results by removing URLs (Uniform Resource Locator), special characters except the # character, unifying dates, and letter levels (Upper or Lower cases) and so on.Figure 2 illustrates the data collection stage, where the hashtags #BLOGGER, #blogger represent the same hashtag but with different writing styles, these hashtags are unified to the same hashtag #blogger.To summarize, the collection step involves two

B. TRIVIAL TRANSFORMATION
The purpose of this transformation process is to organize the set of tweets in a Boolean transaction database.Let = {λ 1 , • • • , λ m } be the set of tweets.This way, we define the tweets transaction database Moreover, for each item j in the transaction D ij , it is set to 1, if the hashtag H j belongs to the tweet i , 0 otherwise, and we have

C. TEMPORAL TRANSFORMATION
The aim of this step is to transform the set of tweets represented by the hashtags H into a transactional database D , considering the temporal information on the published tweets.The time slot ts represents the time of all published tweets, the size of each time window size, and the set of time windows W = {W 1 , W 2 , . . ., W k }, where k = ts size .Each transaction D i groups the published tweets of in the time window W i .Each item j in D i represents one hashtag for published tweets in W i .We also define two values: i) external weight of each hashtag j noted µ(j), set to the number of all tweets containing j, and ii) internal weight of each hashtag j noted ρ(j, D i ) in the transaction D i represents the number of tweets containing j and appeared in the time window W i .The transformation process between the transactional database D and the set of tweets is given as follows: In other words, tweets published in the time W i are seen as one transaction in the transactional database D , and every hashtag corresponds to an item.If a hashtag j belongs to one of the tweets published in W i , the associated item belongs to the transaction D i , and the internal weight of this item is set to the number of published tweets containing the hashtag j.

D. MINING PROCESS
This step aims to extract relevant patterns from a transactional database D , several pattern mining algorithms [37]- [45] have been developed to do such process efficiently.In this paper, we focus on some existing pattern mining algorithms largely used in the recent literature.where the Interestingness (D , I, p) is the measure to evaluate a pattern p among the set of transactions D , and the set of items I , and where γ is the mining threshold [46].
From these two definitions, we present the existing pattern mining problems.
Definition 5 (Boolean Database): We define a Boolean database by setting the function σ (see Def.We define the CFIM problem as an extension of FIM, where the result of FIM is pruned to closed frequent itemsets.Furthermore, an itemset X is closed if and only if there is no superset that has the same support as the given itemset.

Definition 8 (Maximal Frequent Itemset Mining (MFIM)):
We define the MFIM problem as an extension of FIM, where the result of FIM is pruned to maximal frequent itemsets.Furthermore, an itemset X is maximal if and only if it is a frequent itemset for which none of its immediate supersets are frequent.
Definition 9 (Utility Database): We define the utility database by setting the function σ (see Def. 3) as Note that iu ij = ρ(j, D i ) is the internal utility value of j in the transaction D i , we also define the external utility of each item i by eu(j) = µ(j).

Definition 10 (High Utility Itemset Mining (HUIM)):
We define the HUIM problem as an extension of the pattern mining problem (see Def. 4) by where D is the utility database defined by Def. 9, and created using the temporal transformation of the published tweets , γ is the minimum utility threshold.

Definition 11 (High Average Utility Itemset Mining (HAUIM)):
We define the HAUIM problem as an extension of HUIM, where the correlation between items is taken into account for determining utility as where |p| is the number of items of the pattern p. From these definitions, different scenarios may be observed (Please see Table 3 for more details).

E. SEARCHING PROCESS
This step aims at extracting the relevant hashtags regarding to the users' query.Instead of scanning all published tweets, only the set of patterns, noted P obtained in the previous step is used.The results of the searching process for the given query q is as follows: In this step, a user query is represented by a set of hashtags.Moreover, any search process [47]- [49] could be applied in this step to compute the score between the set of discovered patterns, and the hashtags of the user query.
All implementations are executed on a computer with an i7 processor running Windows 10 and 16 GB of RAM.First, the pattern mining algorithms are compared by varying the mining threshold.The best algorithm with the best configuration will be compared with the baseline algorithms for hashtag retrieval problem.The user queries, represented by the set of hashtags, are generated from the texts of the tweets using the chunk-based model proposed by Lee and Croft in [57].First, the original text of each tweet is divided into a set of chunks.Each chunk is a noun phrase or a named entity describing one topic of a single published tweet.Afterwards, the learning-based strategy is used to extract the relevant chunks.Finally, each chunk is cleaned to construct the set of hashtags from it.
To evaluate the retrieved tweets, the MAP (Mean Average Precision), and the F-measure measures have been used.Both measures are widely used metrics to evaluate information retrieval systems, and are defined as follows: 1) F-measure.It combines the precision and recall measures as follows: where Recall = |RRT | |RT | is the ratio of the number of retrieved relevant tweets (RRT) to the total number of relevant tweets (RT), and Precision = |RRT | |RET | is the ratio of the number of retrieved relevant tweets (RRT) to the total number of retrieved tweets (RET).

2) MAP. It is computed as:
where Precision@i is the precision at rank i, i.e., we consider the first i ranked tweets and we ignore the remaining tweets.

A. PATTERN MINING ALGORITHMS PERFORMANCE
This first experiment aims to select the best pattern mining algorithm of each task (FIM, ClosedFIM, MAXFIM, HUIM, HAUIM).Several algorithms have been tested and compared on the tweet collection in terms of the runtime performance and the memory consumption: 1) ALLFIM-HR: Three algorithms have been compared, Apriori [41], PrePost+ [51], and SSFIM [39].The results, reported in Figure 3.  non-dense databases, where all the tweet collections are dense, and Apriori needs more scan to find all the frequent patterns from the given tweets.Note that the tweets collection is reduced after the preprocessing step, and the set of transactions became non sparse after the transformation step.Moreover, PrePost+ is based on fpgrowth algorithm [58], where only two scans database is needed, and also it benefits from efficient data tree structures to store and manipulate the candidate patterns.For this, PrePost+ algorithm is selected as frequent itemset mining algorithm in PM-HR framework.2) ClosedFIM-HR: Three algorithms have been compared, FPClose [40], Charm [52] and LCM [53].The results, reported in Figure 3.(b) and Figure 4.(b), reveal that FPClose outperforms the two other algorithms (Charm and LCM) for five among eight cases, in terms of runtime and memory usage.This is explained by the fact that FPClose performs well on sparse data by employing an efficient FP-array technique that highly reduces the need to traverse the candidate patterns tree structure.For this, FPClose algorithm is selected as closed itemset mining algorithm in PM-HR.3) MAXFIM-HR: Two algorithms have been tested, FPMax [37], and LFI-Miner [54].The results, reported in Figure 3.(c) and Figure 4.(c), reveal that LFI-Miner outperforms FPMax in terms of runtime and memory consumption for all cases.The reason of these results is that LFI-Miner uses an fpgrowth algorithm [58], and investigates efficient pruning strategies to reduce the search space such as trimming insufficient frequent hashtags that cannot contribute to generate longer frequent patterns.For this, LFI-Miner algorithm is selected as maximal itemset mining algorithm in PM-HR framework.4) HUIM-HR: Three algorithms have been tested, EFIM [59], HUP-Miner [60], and UP-Growth+ [61].
The results, reported in Figure 3.(d) and Figure 4.(d), reveal that EFIM outperforms the two other algorithms (HUP-Miner and UP-Growth) in terms of runtime and memory consumption for six cases among eight cases, where UP-Growth gives best results in two cases.These results could be explained by the fact that EFIM proposed several optimizations improve the mining process such as: develops two new upper-bounds to greatly prune the search space and introduces different database projection and merging approaches to reduce the data size.For this, EFIM algorithm is selected as high utility itemset mining algorithm in PM-HR framework.5) HAUIM-HR: Two algorithms have been compared, TPAU [55], and TKAU [56].The results, reported in Figure 3.(e) and Figure 4.(e), reveal that TKAU outperforms TPAU algorithm in terms of runtime and memory consumption for all cases.The reason of these results is an efficient list employing in TKAU algorithm that reduces the join operations cost in computing the utilities of the candidate patterns.For this, TKAU algorithm is selected as high average utility itemset mining algorithm in PM-HR framework.Table 5 shows the number of relevant patterns discovered of the different pattern mining algorithms on tweets collections, by setting the minimum support, the minimum utility and the average minimum utility values to 0.10, respectively.According to this table, the results reveal the two following issues: i) From classical FIM to closed FIM, and maximal FIM, the number of relevant patterns discovered is reduced, approximately in order of two times.Thus, the number of relevant patterns discovered by ALLFIM-HR is 22, 514, where it is 12, 547 for ClosedFIM-HR and 8, 883 for MAXFIM-HR, when mining the largest Football tweet collection, that contains 3, 000, 000 of tweets and 90, 660 of hashtags.
ii) From high utility itemset mining to high average utility itemset mining, the number of the relevant patterns discovered is highly reduced, approximately in order of five times.Thus, the number of relevant patterns discovered by HUIM-HR is 9, 976, whereas it is only 2, 105 for HAUIM-HR, when dealing Football collection.
These results are obtained thanks to the pruning strategies used (closure, maximal, and high average) that reduce the number of relevant patterns discovered.We can also say that the high average property is stronger than the two first properties (closure and maximal).We can explain this by the fact that the high average property is applied on the items (hashtags) of the given pattern, where the closure and the maximal properties are only applied between patterns.The three properties (closure, maximal, and high average) reduce the pattern space but several issues should still be addressed: 1) Does this effect on the final results of the hashtag retrieval process?, 2) which strategy is the best?3) Is possible to have an over reduction, if it is the case, which strategy produces the over reduction?.All these questions will be answered in the next experiment.
Table 6 shows the accuracy of the pattern mining algorithms by varying the number of relevant patterns discovered from 100 to 1,000.These results reveal that by increasing the number of relevant patterns, the accuracy of the pattern mining algorithms increased.For instance, with a number of relevant patterns equal to 100, the accuracy of pattern mining algorithms does not exceed 74%, whereas, for a number of relevant patterns equal to 1,000, the accuracy of the pattern mining algorithms reaches 89%.These results confirm the importance of studying the correlations of the set of hashtags to improve the accuracy of the hashtag retrieval process.Table 7 shows the accuracy of the pattern mining algorithms by varying the the size of time windows from 10 to 100.These results reveal that by increasing the time windows up to 50, the accuracy of the pattern mining algorithms increased.However, with time windows bigger than 50, the accuracy of the pattern mining algorithms decreased.These results explain the importance of choosing suitable time windows in the temporal transformation to improve the accuracy of the hashtag retrieval process.In addition, we can explain these results by the fact that by increasing time windows to 50, more hashtags are concatenated to one transactions, which augments the number of relevant patterns discovered.However, with time windows bigger than 50, a few transactions are created, which decreases the number of relevant patterns discovering, in particular when the mining threshold is not well tuned.Figure 5 shows the quality of returned hashtags by the different pattern mining algorithms on tweets collections, by varying the minimum support, the minimum utility and the average minimum utility values from 0.40 to 0.01, respectively.According to this figure, the results reveal that all algorithms increase their quality when decreasing the mining threshold until a certain value (0.05 for sewol ferry, Wikepedia1, Wikepedia2, Wikepedia3, 0.02 for nelson mandel, football, TREC2011, and TREC2015), where the quality decreases.Moreover, HAUIM-HR outperforms the other algorithms in terms of F-measure, whatever the tweet collection, and the minimum threshold used in the experiments.These results could be explained by the fact that the HAUIM-HR benefits from the temporal information obtained in the transformation step, and also benefits from the high average property that efficiently reduce the patterns space.The results also reveal that MAXFIM-HR gives worst results compared to the other pattern mining algorithms, whatever the running case applied in the experiment.These results are obtained due to MAXFIM-HR over reduction strategy in the case of hashtags data, where only maximal patterns are derived.We can conclude from these experiments that the temporal transformation is more interesting than the trivial transformation.Thus, using the temporal transformation, the pattern mining algorithms derive co-occurrences patterns presented on different timestamps, whereas the trivial transformation does not consider the time of published tweets.This degrades the overall performance of the pattern mining algorithms.Moreover, the HAUIM-HR is chosen for the remaining of the experiments using the best minimum utility threshold (0.05 for sewol ferry, Wikepedia1, Wikepedia2, Wikepedia3, 0.02 for nelson mandela, football, TREC2011, and TREC2015).

B. PM-HR VS STATE-OF-THE-ART HASHTAG RETRIEVAL ALGORITHMS
The last experiment aims at comparing the PM-HR framework with recent state-of-the-art HR approaches using the tweet collections in terms of runtime and the solution quality.Figure 6 presents the runtime of PM-HR and the baseline approaches Hashtagger+ [62], ATR-Vis [63], SAX* [64], and SCSHG [65] with different tweet collections different number of queries.By varying the number of queries from 1, 000 to 10 million queries, the result reveals that PM-HR highly outperforms the other baseline approaches, in particular for large collections.Thus, for football tweet collection that contains 3, 000, 000 of tweets and 90, 660 of hashtags, PM-HR needs only 6, 000 seconds for dealing 10 million queries, where the other approaches need more than 22,000 seconds for dealing the same number of queries.Moreover, the runtime performance of PM-HR stabilizes when increasing the number of queries, whereas the runtime of other approaches highly augmented.These results are obtained thanks to the knowledge base designed by PM-HR, which represents the relevant patterns of the tweet collections.Instead of exploring the whole collection as in the baseline approaches, only this knowledge base is explored.
In terms of solution quality, F-measure (Eq.5) and MAP (Eq. 6) have been used.Table 8 compares the quality of tweets retrieved by PM-HR and the baseline approaches: Hashtagger+, ATR-Vis, SAX*, and SCSHG.Results reveal that for medium tweet collections such as Sewol ferry, Wikepedia2, and Wikepedia3, the baseline approaches 5. F-measure of the pattern mining algorithms using different tweet collections and with different mining threshold.outperform PM-HR.But for large tweet collections such as football, TREC2011, and Nelson Mandela, PM-HR outperforms the four other approaches.These results again show the benefits of using pattern mining techniques to explore tweet collections.Moreover, it confirms the usefulness of i) the high average utility itemset mining for discovering relevant patterns, and ii) introducing the temporal information in exploring the tweet collections.

C. DISCUSSION
This section discusses the main findings from the application of the proposed framework to real challenging tweet collections.
• The first finding of this study is that the proposed framework can deal with a large number of tweets, hashtags, and queries in real time.This is different from previous hashtag retrieval approaches, which have long execution times due to the high dimensionality of the set of the hashtags.The proposed framework provides both inductive and predictive character: i) Our framework is able to induce the knowledge-based system by applying the pattern mining algorithms for identifying the most representative patterns of the given tweet collection, and ii) Our framework is able to predict the relevant tweets from the user query without considering the whole tweet collection.In the context of hashtag retrieval, we argue that considering the temporal information and the high average utility patterns in the preprocessing step allows to quickly derive the relevant tweets.
• From a data mining research standpoint, PM-HR is an example of the application of a generic pattern mining algorithm to a specific context.The literature calls for this type of research, particularly in the times of social media analysis, where a large and big number of tweets is available in a daily life.As in many other cases, porting a pure data mining technique into a specific application domain requires methodological refinement and adaptation [6], [9], [11], [12].In our specific context, this adaptation is implemented in different phases, such as transformation, mining process, and searching process.To the best of our knowledge the approach proposed in this paper is the first one that investigates pattern mining with temporal information to explore large tweet collections.

VI. CONCLUSION
This paper integrates the high average-utility itemset mining to solve the information retrieval of hashtag problem.The proposed approach HAUIM-HR benefits from the high average-utility itemsets to improve the searching process for finding the most relevant hashtags according to the given query.A pre-processing step is first performed to transform the corpus into the inputs of the transactional database considering the temporal information of the published tweets.The mining process is then established to discover the high average-utility itemsets, which is generated in the previous step.The searching step benefits from the high averageutility itemsets and the knowledge-based system to find the most relevant hashtags for a given instance query.From our research study, we can conclude that the main advantage of using pattern mining compared to the baseline approaches for solving hashtag retrieval problem is studying the different correlations among the set of hashtags in a given tweets collection.Moreover, extensive experiments carried out on large and different number of corpus of published tweets show that our solution benefits from the knowledge extracted (i.e., high average-utility itemsets), and outperforms the baseline methods in terms of runtime and it is very competitive in terms of accuracy.The proposed framework fails where no relevant hashtags appear in the query.Traditional solutions are more suitable, if there is no relevant hashtags.However, this case rarely happened in real scenarios, where the set of queries with the relevant hashtags are appeared.In the future work, we plan to discover different knowledge such as maximal high average-utility itemsets, and closed high average-utility itemsets to improve the search performance, as well as the accuracy.We will also consider the spatial dimension to transform the tweets corpus to the transactional database.Moreover, it is necessary to design a parallel approach that relies on high performance computing tools such as MapReduce or Spark to deal with big corpus of published tweets.

Definition 3 (
Pattern): Let us consider I = {1, 2, • • • , n} as a set of items, and D = {D 1 , D 2 , • • • , D m }, where n is the number of items and m is the number of transactions.We define the function σ , where for the item i in the transaction D j , the corresponding pattern reads p=σ (i, j).Definition 4 (Pattern Mining): A pattern mining problem finds the set of all relevant patterns L, such as L = {p|Interestingness(D , I , p) ≥ γ } 3) as σ (i, j) = 1 if i ∈ D j 0 otherwise Definition 6 (Frequent Itemset Mining (FIM)): We define the FIM problem as an extension of the pattern mining problem (see Def. 4) by L = {p|Support(D , I , p) ≥ γ }, VOLUME 8, 2020 with Support(D , I , p) = |p| D ,I |D | (1) where D is the boolean database defined by Def. 5, and created using the trivial transformation of the published tweets , γ is a minimum support threshold, and |p| D ,I is the number of transactions in D containing the pattern p. Definition 7 (Closed Frequent Itemset Mining (CFIM)):

FIGURE 3 .
FIGURE 3. Runtime of the pattern mining algorithms using different tweet collections.
(a) and Figure 4.(a), show that PrePost+ outperforms Apriori and SSFIM in terms of runtime and memory usage for all cases.These results are explained by the fact that SSFIM works on

FIGURE 4 .
FIGURE 4. Memory usage (MB) of the pattern mining algorithms using different tweet collections.

FIGURE 6 .
FIGURE 6. Runtime (s) of the PM-HR and state-of-the-art hashtag information algorithms using different number of queries.

TABLE 8 .
F-measure and MAP for the PM-HR and state-of-the-art hashtag information algorithms, and with different tweet collections.

TABLE 1 .
Classification of information retrieval approaches and their limitations.

TABLE 2 .
Example of corpus of published tweets.

TABLE 3 .
Transaction database D .

TABLE 5 .
Number of relevant patterns discovered on tweets collections for the pattern mining algorithms, by setting minimum support, minimum utility and average minimum utility values to 0.10, respectively.

TABLE 6 .
F-measure of the pattern mining algorithms by varying the number of relevant patterns discovered on Wikepedia3.

TABLE 7 .
F-measure of the pattern mining algorithms by varying the size of time windows in temporal transformation using Wikepedia3.