LMRank: Utilizing Pre-Trained Language Models and Dependency Parsing for Keyphrase Extraction

Keyphrase extraction is a Natural Language Processing task pertaining to the automatic extraction of salient terms that semantically encapsulate the major theme and topics of a document. In this article, we present LMRank, a novel approach that utilizes dependency parsing and the sentence embeddings of pre-trained language models to improve the accuracy of the keyphrase extraction task. In addition, we conduct a benchmark analysis of our approach, which showcases that it scales far better than similar ones in terms of execution time. The contribution of this work is threefold: (i) we propose a novel approach that significantly outperforms the state-of-the-art keyphrase extraction approaches in terms of time performance and accuracy in selected datasets; (ii) we provide a comparative evaluation of our approach against previous ones, by utilizing broadly used datasets in the literature and established evaluation metrics (e.g., the F1 and pF1 scores); (iii) we make the datasets and code used in our experiments public, aiming to further increase the reproducibility of this work and facilitate future research in the field.


I. INTRODUCTION
Keyphrase extraction (KE) is a basic Natural Language Processing (NLP) task; it has been described as the process of automatically extracting keyphrases from a document, i.e., a set of phrases containing one or multiple words that are considered to be meaningful and representative for a document [1]. Various Information Retrieval (IR) and NLP tasks, such as text summarization, categorization, classification, and generation of recommendations based on textual data, greatly benefit from the utilization of KE approaches [2].
Research on KE gains increasing interest, as documented by recent approaches including: (i) HyperMatch [3], which is a deep learning approach that embeds keyphrases and the document in the same hyperbolic space and estimates The associate editor coordinating the review of this manuscript and approving it for publication was Ali Shariq Imran . their relevance between them by utilizing the Poincaré distance; (ii) the approach described in [4], which suggests the use of the intermediate layers of BERT [5] that are often ignored by previous KE approaches; (iii) HAKE [6], which simultaneously utilizes multiple textual features (linguistic, statistical, structural and semantic) to extract the most appropriate keyphrases; (iv) PatternRank [7], which relies on pre-trained language models and POS (Part-of-Speech) tags to extract the most relevant keyphrases. It is noted that, while certainly interesting, these approaches have not made their code publicly available.
Overall, KE approaches proposed in the literature can be separated into two major groups, namely unsupervised and supervised ones, both possessing certain advantages and drawbacks. Specifically, one major disadvantage of supervised approaches is that they require a large number of training data to be annotated by human experts, in contrast to unsupervised ones [1]. Another drawback of supervised approaches is that they often exhibit bias towards the domain of their training, thus they cannot generalize on new domains [1]. A major advantage of the unsupervised approaches is that they are applicable in various settings, since they do not require any training or domain specific knowledge [8]. However, in datasets where their texts are thoroughly labelled by humans, the supervised approaches have been shown to achieve better scores during the evaluation (compared to their unsupervised counterparts), as documented in [8] and [9]. The authors of [8] also highlight the ability of unsupervised approaches to run in real time due to their computational efficiency. For all the above reasons, this paper focuses on unsupervised approaches.
These are often evaluated using exact or partial match F1 metrics. For instance, in the review presented in [9], the authors comparatively assess multiple KE approaches by using such metrics, arguing that the partial match evaluation framework yields F1 scores that are closer to those assessed manually by human experts (compared to the scores calculated by the exact match evaluation framework).
Unsupervised KE approaches have been further classified as classical or embeddings-based ones [9]. Compared to the embeddings-based approaches, the classical ones miss important semantic information from the text, thus resulting in lower F1 scores. However, most embeddings-based KE approaches today do not build on recent advancements in deep learning models that demonstrate increased accuracy in many NLP tasks. In addition, while many embeddings-based KE approaches utilize regular expression patterns that extract phrases consisting of nouns and adjectives, these patterns often ignore common language patterns (e.g., conjunctions); this results to the production of long phrases that consist of multiple keyphrases. As proposed in [10], dependency parsing, which refers to the process of examining the dependencies between the phrases of a sentence to determine its grammatical structure, provides a way to produce meaningful and cohesive candidate keyphrases.
Aiming to advance the state-of-the-art of embeddingsbased keyphrase extraction, this article proposes a novel unsupervised KE approach, called LanguageModelRank (LMRank), which builds on the strengths of the abovementioned models and techniques. Specifically, our approach builds on dependency parsing for the candidate keyphrase extraction step and leverages the accuracy of sentence embeddings from pre-trained language models, aiming to augment the quality of keyphrase ranking. Our approach is unsupervised, in that it does not require labelled data; thus, it can be used in texts of different themes and topics without additional training on multiple documents requiring manually assigned keyphrases by human experts. The accuracy and performance of the proposed approach are thoroughly assessed against a selected set of prominent unsupervised KE approaches existing in the literature.
The overall contribution of the work described in this article is threefold: (i) we propose a novel KE approach, which reaches or surpasses the state-of-the-art regarding the KE task; (ii) we provide a comparative evaluation of our approach compared to other similar ones, by utilizing widely known datasets of the literature and established evaluation metrics (e.g., the F1 and pF1 scores); (iii) we make the datasets 1 used, the code of the proposed approach, 2 as well as the code developed for our KE experiments 3 publicly available, aiming to further increase the reproducibility of this work and facilitate future research in the field.
The remainder of this article is organized as follows. Background concepts and related work concerning selected classical and embeddings-based unsupervised KE approaches are analyzed in Section II. The proposed approach is presented in Section III. The evaluation of these approaches, together with the associated technical specifications, datasets and metrics, are presented in detail in Section IV. Finally, concluding remarks and future work directions are outlined in Section V.

II. RELATED WORK
As suggested in [9], the unsupervised KE approaches can be divided into two major subcategories, namely the classical and the embeddings-based ones. These can be further split into more subcategories according to the underlying technique employed each time (thus distinguishing them as statistical approaches, graph-based approaches, etc.). Adopting the above classification, this section reports on prominent KE approaches that have been used as the baseline in our experimentations.

A. CLASSICAL APPROACHES
TextRank [11] is a graph-based approach, which builds a word graph. As a first step, it assigns POS tags for each term in the text; then, candidate keyphrases that consist of nouns and adjectives are selected. Each candidate keyphrase is added to the graph as a node. Edges connect terms, which are included in a sliding window of N terms. For the case of undirected and unweighted edges, the TextRank score S(v i ) for each node v i is calculated by the following recursive formula: where d is the damping factor, set to 0.85 as proposed in [1], and (v j ) denotes the set of neighboring nodes of v j . When (1) converges, the keyphrases (nodes) are sorted in descending order by their calculated scores. KP-Miner [12] is a statistical approach that utilizes the classic information retrieval TF-IDF metric score alongside two statistical features, which affect the selection of candidate keyphrases. The first feature is the cutoff constant, which ensures that keyphrases which had their first occurrence after this constant will be filtered out of the list of candidate keyphrases. The second feature is the least allowable seen factor (k), which filters out keyphrases that appeared less than k times. As a final step, this approach ranks the filtered list of candidate keyphrases and returns the top-n specified. This ranking is determined by the combination of the TF-IDF keyphrase scores, their positions in the text, and a boosting factor that favors keyphrases consisting of multiple terms instead of a single one.
MultiPartiteRank [13], abbreviated as MPRank, is a graphbased approach that relies on topic modelling. The first step of this approach is the construction of multipartite graph, which models keyphrase candidates as nodes in the graph, while the edges connect keyphrases belonging to different topics. These edges are weighted based on the distance between each pair of candidate keyphrases (c i , c j ). Their weights are adjusted according to the following equation: where e w (c i , c j ) is the weight between each pair of candidate keyphrases, α is a hyperparameter that controls the weight adjustment, p i is the relative position in text of c i and T (c j ) -{c j } is the set of candidate keyphrases that belong to the same topic as c j , without including it. After the graph is constructed, the extracted keyphrases are ranked using the TextRank approach. As a final step, the top-N ranked keyphrases are extracted. YAKE! [14] is a statistical approach that utilizes various statistical metrics. It first splits the text into individual terms and then calculates a score S(t) for each term t. This score relies on the following metrics: Tcase (casing aspect of t; this metric considers that uppercase terms or terms starting with a capital letter and are not found near the start of a sentence have higher importance than other ones); Tpos (favors terms are positioned near the beginning of the document); TFnorm (term frequency normalization); Trel (term relatedness to context; this metric measures the number of distinct terms that are found on the left and right side of t); Tdifsent (favors terms that appear more frequently across different sentences). For each term t, the score S(t) is calculated as follows: As shown in (4) below, for each candidate keyphrase ck, a score S(ck) is calculated, which relies on the S(t) scores of its constituent terms. It is noted that for smaller values of S(ck), the quality of the ck is increased.

B. PRE-TRAINED EMBEDDING MODELS
One major drawback of the classical approaches is that they do not encapsulate the semantic information of both the candidate keyphrases and the document. This information can be incorporated through pretrained word, phrase or sentence embeddings, which were introduced through the Word2Vec model [15], as an attempt to improve the accuracy of existing NLP approaches. A series of similar models have been then introduced in the literature, including Doc2Vec [16], GloVE [17], FastText [18], SIF [19], Sent2Vec [20] and ELMo [21]. Generally speaking, earlier models calculate embeddings at the term level (word embeddings), while later ones model their embeddings at the phrase or sentence level (phrase / sentence embeddings). The advantage of the latter is that they capture additional semantics. After the introduction of the Transformer model [22], several pre-trained deep learning language models have been also proposed in the literature, including BERT [5], MUSE [23], and DistilBERT [24]. These models demonstrate increased accuracy in various NLP tasks over their predecessors.

C. EMBEDDINGS-BASED APPROACHES
The approaches described in this section utilize the semantic information provided by the embedding vector representations mentioned in the previous section. This semantic information facilitates the computation of the semantic similarity between the candidate keyphrase and the document itself, where more similar keyphrases that capture the semantic context of the document are ranked higher.
WordAttractionRank [25] is a graph-based approach, which utilizes pretrained word embeddings. Initially, this approach preprocesses the input document (e.g., performs tokenization, applies POS tags, and extracts adjectives and nouns), similarly to other graph-based approaches. Then, it constructs a graph of terms, where terms are represented as nodes connected with edges that model their co-occurrence in a window of N terms. The edges are weighted using the word attraction score, which is calculated as the product of the Dice coefficient of term frequencies and the force attraction score for each pair of terms. The force attraction score is calculated as the product of term frequencies for a pair of terms, divided by the Euclidean distance of their word embeddings. After this score is calculated for each term, cooccurring terms (in a window of N terms) are concatenated into keyphrases. Keyphrases that do not end with nouns are filtered out. Finally, the top-n keyphrases with the highest word attraction score are extracted.
Reference Vector Algorithm (RVA) [26] is a statistical approach. Initially, RVA trains GloVe embeddings from the terms of a document. In a second step, the mean word embedding of terms found in the title and abstract of a document is calculated. In a third step, this approach extracts n-grams, where 1≤ n≤ 3, from the title and abstract. These n-grams are the considered candidate keyphrases. Finally, each candidate keyphrase is ranked based on the descending cosine similarity score between the document embedding and the mean embedding of the candidate keyphrase; finally, the top-n most similar keyphrases are extracted.
EmbedRank / EmbedRank++ [27] is a statistical approach, which utilizes sentence level embeddings. The EmbedRank++ version adds an extra processing step over EmbedRank that keeps diverse keyphrases for the final list. The major difference of this approach over previous ones is that it utilizes sentence embeddings, which capture more semantic information compared to earlier embedding models. EmbedRank utilizes a three-step process. As a first step, the candidate phrases are extracted from the document if they solely comprise nouns or adjectives or both. As a second step, the document embedding is computed as the average of all sentence embeddings; the candidate keyphrase embeddings are similarly calculated. Thirdly, the keyphrases are ranked based on their cosine distance of their embedding vector from the document embedding vector. The extra diversification step of EmbedRank++ is to measure a modified, for the KE task, Maximal Marginal Relevance (MMR) metric. This metric re-ranks the final keyphrase list in a way that diverse keyphrases, which are not semantically similar, are ranked higher before extracting the top-N ones. This metric is computed using the following equation: where C i , C j are the embedding vectors of candidate keyphrases i, j, C is the set of all candidate keyphrases, K is the set of all extracted keyphrase and λ is a hyperparameter that controls the amount of diversity of the final list of candidate keyphrases. Finally,c os sim is the normalized cosine similarity function applied between two vectors. Key2Vec [10] is a graph-based approach, which utilizes phrase embeddings. Initially, it generates phrase embeddings from a FastText embeddings model that is trained on a large scientific corpus. This model is trained on candidate keyphrases, which are extracted n-grams, comprising solely nouns and adjectives. Similarly, to earlier approaches, the candidate keyphrases are scored by calculating the cosine similarity between the candidate keyphrase embeddings and the document embeddings. However, the candidate keyphrases are ranked by utilizing a weighted graph of terms with edges connecting terms co-occurring in a fixed window size. For each pair of terms, the weights of the graph are calculated by dividing their cosine similarity with the pointwise mutual information metric of their term frequency co-occurrence. Finally, after the weight for each edge of the graph is calculated, the keyphrases are ranked based on their descending weighted Personalized PageRank [28] score and the top-n keyphrases are extracted.
SIFRank / SIFRank+ [29] is a statistical approach, which utilizes the SIF sentence embedding model and the ELMo pre-trained language model to produce embeddings. Both versions of this approach share the same underlying methodology; nonetheless, SIFRank+ performs better on longer documents, while SIFRank performs better in shorter ones. The approach comprises four steps. In the first step, the candidate keyphrases are selected similarly to other approaches. At the second step, the word embedding vector, for each term of all candidate keyphrases, is generated from ELMo. At the third step, the previously calculated word embeddings are aggregated by SIF(see Section II-B) to produce a sentence embedding for each candidate keyphrase; in this step, the mean embedding is also calculated for the document itself. At the final step, keyphrases are ranked by the cosine similarity between their embeddings and the document embedding.
For each candidate keyphrase, SIFRank+ performs an additional step by multiplying the former similarity score by a position-based SoftMax function (see Eq. 6 and Eq. 7). The pck i1 is the first positional occurrence of a selected candidate keyphrase, while µ is a hyperparameter to optimize position-biased weight of the candidate keyphrases at the beginning, especially the first phrase.
KeyBERT [30] is a statistical approach, which relies on the pre-trained sentence transformer approach [31]. As a first step, the model extracts the candidate keyphrases as a list of n-grams, based on their occurrence frequency. Then, it generates a sentence embedding for each candidate keyphrase and a document embedding, similarly to earlier approaches using a user-selected pre-trained model. It is noted here that many pre-trained sentence transformer models are already available. 4 In this work, we utilize the all-mpnet-base-v2 model, which is based on the MPNet approach [32]. Afterwards, the keyphrases are ranked based on the descending cosine similarity score between their embeddings and the document embedding.
Similarly to EmbedRank, KeyBERT includes an extra diversification step for the ranked list of keyphrases. This step either computes the MMR metric, (as shown in Eq. 5), or the Max Sum Similarity metric proposed by the KeyBERT authors.
KPRank [33] is a graph-based approach, which embeds positional and contextual information into a biased PageRank algorithm. KPRank calculates a document embedding vector, similarly to other approaches, by utilizing a pre-trained language model named SciBERT [34]. Afterwards, this approach builds a term graph where for each node v i , a theme score is calculated as the cosine similarity between the embedding of v i and the document embedding. This score is used to assign a higher probability score to a term with a higher theme score. Furthermore, KPRank calculates a position score for each term in the graph, which is the sum of its inverse positional occurrences in the document. For each node v i , a weight v wi is calculated as the product of the aforementioned theme and position scores. All node weights are normalized and stored into a vectorp as described in (8). This vector is used as a positional bias for the biased PageRank formulated in (9).
In Eq. (9), v i , v j are the nodes of the term graph, w ij is the edge weight set to the co-occurrence frequency between a pair of terms within a fixed sized window (k = 10), and | out(v j )| is the out-degree of node v j . Finally, the candidate keyphrases are ranked by their descending S(v j ) scores and the top-n of them are extracted.

III. LMRank: THE PROPOSED APPROACH
As shown in Fig. 2, the approach proposed in this article follows a multi-step methodology, which is similar to the earlier works presented in Section II.

A. LMRank
The first step of our approach (see pseudo-code in Alg. 1) is to extract a list of candidate keyphrases from the text. This is done by extracting noun phrases (NPs), i.e., phrases that contain terms with multiple POS tags, where the final term of the NP is a noun. NPs are built using syntax dependency parsing, which connects terms in a syntactic dependency tree. When traversed, this tree yields the NPs. This is visualized through a sample sentence in Fig. 1. As shown, the NPs are 'float switch', 'relief valve' and 'pressure relief valve'. It is noted that earlier approaches would have found the position of the final noun of a sentence and traced the sentence backwards until they find the first noun or adjective in the sequence. This would erroneously result in a single candidate keyphrase, e.g., 'carbonator float switch and pressure relief valve'. The reason we follow this technique is that regular expressions ignore common speech patterns (e.g., conjunctions), which leads to the production of long phrases consisting of multiple keyphrases. The removal of conjunctions from keyphrases is also discussed in [35].  Apart from the dependency parsing, we also filter out candidate keyphrases that have certain patterns. The removal of these patterns leads to certain accuracy benefits [35], [36], [37]. Specifically, the removed keyphrases have one of the following patterns: (i) the keyphrase is a stop word [36], [37]; (ii) the keyphrase starts with a pronoun ('he', 'she', 'they') or a particle (e.g., 'not') [35]; (iii) the keyphrase has a character length less than three (e.g., 'a', 'an', 'to') [36]; (iv) the keyphrase starts with a digit (e.g., '90% percent of participants') [36]. In our approach, we also define a boolean flag, called keep_nouns_adj, which -when set to True -filters out noun phrases that their individual terms are not (proper) nouns and adjectives; this pattern is referred to as the NP chunking technique [37]. For keyphrases that appear often in the text, we maintain their first positional occurrence and remove their exact duplicates from the list to minimize the number of computations.

Algorithm 1 Pseudo-Code of LMRank
This step also has an optional feature, called deduplication, which can be toggled by setting the deduplicate flag of our extract_keyphrases() method to True. As mentioned in Section II, due to the similarity score, a lot of embeddings-based approaches extract keyphrases that are redundant (e.g., in a document that discusses machine learning, the extracted keyphrases could be 'machine learning', 'machine learning programs', 'machine learning algorithms' etc.). To eliminate these duplicates, we sort the list lexicographically by using Python's stable sort algorithm; this has the benefit of placing the shortest keyphrases earlier in the list. We consider these keyphrases to be more generic, since they constitute the largest common substrings. To identify and remove the redundant keyphrases, we use a native Python function called get_close_matches() that can be imported from the difflib module; 5 in this function, we set a cutoff 5 https://docs.python.org/3/library/difflib.html of 0.65 and the number of nearest matches to 10, meaning that the top-10 most similar keyphrases that share more than 65% of their characters with the largest common substring are deleted from the list of candidate keyphrases.
The second step of our approach is to calculate the embeddings for the candidate keyphrases and the document into a common embedding space. Firstly, we use the sentence transformer model discussed in Section III-B, which has a max sequence length of 384. The candidate keyphrases that consist of a few words, typically between two to five, are much shorter in length, therefore we directly embed these in a single pass. For the document embedding, we retrieve its token length from the English NLP model of spaCy. If the text has a token length shorter than the max sequence length, we embed the entire document at once; if the text is longer, we split the document in chunks of 384 and we pass these chunks to the sentence transformers model. The document embedding is the mean embedding of all chunk embeddings.
We refer to the above technique as ''text chunking'', with an alternative being to split the text into sentences and then encode their embeddings and take their mean embedding for the document representation. During our experiments, we noticed that this alternative technique was slower in terms of execution time of the embedding encoding process. However, the text chunking technique relies heavily on the fact that the writing system of a language uses whitespace characters as word separators, which for certain languages (e.g., Chinese, Japanese, Korean, Thai, Khmer) is not true. In the case of such languages, we would need to use the sentence splitting technique before calculating the embeddings. If we do not use one of these techniques, then any employed transformer model would ignore all text beyond its input size limitation due to the attention mechanism [38], and our approach would miss valuable context in its document embedding representation. 71464 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
The third and final step of our approach is the candidate keyphrase ranking. This is done by calculating the cosine similarity score between the document representation and the candidate keyphrases. We start by normalizing the embedding vector of the candidate keyphrase; then, we store these normalized vectors in an inner product index. This is equivalent to calculating the cosine similarity score between the queried vector and the vectors stored in the index. Afterwards, we normalize the document vector and query the index to find the top-k most similar vectors and their similarity scores, where top-k is equal to the length k of the list of the candidate keyphrases, since we rank the entire list by the descending similarity score. The reason we do not instantly extract the top-n keyphrases is that we may optionally employ an additional post-processing step, which multiplies the similarity score by the SoftMax value of the inverse first position score, as shown in Eq. (7). For the position metric adopted in our approach, we set the hyperparameter µ to 1. The reason that we incorporated this metric is the increase in accuracy scores, as argued by the authors of SIFRank and KPRank. This post-processing step can be set using the positional_feature flag of the extract_keyphrases() method. In any case (independently of whether we employ the post-processing step or not), the top-n keyphrases are extracted from the final list of keyphrases.

B. SOFTWARE SPECIFICATIONS
For the candidate keyphrase extraction step, we utilize the spaCy [39] library; 6 especially, its pretrained natural language model named en_core_web_sm. 7 Specifically, we extract candidates using spaCy's noun chunks, which are essentially NPs. Generally speaking, spaCy is considered as a very prominent NLP framework in terms of execution speed [40]. However, this performance is compensated by the cost of not achieving the state-of-the-art accuracy in most NLP tasks. In any case, we make clear that our approach does not depend on spaCy; other NLP libraries can be used instead of it (e.g., Stanza [40]).
For the embeddings of the candidate keyphrase and the document, we are utilizing the sentence transformers approach and the all-mpnet-base-v2 model, 8 which is currently the most accurate English language model regarding the semantic similarity task [32]. This model embeds each inputted text sequence into a 768-dimensional embedding vector. Currently, our approach supports only the English language; however, through the use of multilingual models, we can expand our approach into more natural languages. The reason we selected the aforementioned approach for sentence embeddings is because this library supports multithreading and multiprocessing across multiple devices (CPU threads, multiple GPUs etc.), which -compared to earlier transformer approaches -reduces the execution time dramatically without sacrificing the accuracy of the semantic similarity task.
For the candidate keyphrase ranking process, we use the FAISS 9 vector search library [41] of the Facebook AI Research (FAIR) lab, which enables a fast and efficient comparison of multiple vectors using a low-level C API. As reported in [41], FAISS achieves state-of-the-art speedup in terms of execution time thanks to CPU/GPU parallelisms. FAISS has been benchmarked in terms of performance and accuracy vs. other similar libraries, and it is reported to achieve good accuracy and performance [42]. Compared to earlier embedding-based approaches, the use of FAISS in our approach saves computational time in the candidate keyphrase ranking step. In this step, the cosine similarity is calculated between the embeddings of the candidate keyphrases and the document embedding, while the list of candidate keyphrases is ranked by the descending similarity score.

IV. EVALUATION
This section reports on the evaluation of our approach against similar works described in Section II.

A. TECHNICAL SPECIFICATIONS
As far as hardware specifications are concerned, we utilized a PC workstation with an Intel Core i9 CPU with 20 logical cores and maximum clock speed of 5 GHz, and two dual inline memory modules, each having 32 GB RAM. With respect to software specifications, we used a Windows 10 native OS 64-bit. It is noted that, since some of the evaluated approaches require Python versions that are older than 3.8, as well as older packages that are no longer supported, we run the evaluation with an Ubuntu Mate 64-bit Linux Virtual Machine, which accesses 32 GB of system memory, as well as 16 out of 20 logical cores. For the selected unsupervised classical and embeddings-based approaches (see Table 1), we utilized software implementations provided by their original authors, whenever possible. For the approaches that no official software implementation is mentioned in their original papers, we used third-party libraries such as pke [43], which implements many approaches found in the literature.

B. DATASETS AND METRICS
With respect to datasets, we run experiments using eight English datasets. We categorize these datasets based on their textual length (i.e., short, medium and long length). For datasets with short texts (i.e., paper abstracts with approximately 50-200 terms), we opted for Inspec [44], WWW and KDD [45]; we also opted for Semeval-2017 [46], which contains paragraphs from scientific journals and is similar to the former ones in terms of their length. For datasets with medium length texts (i.e., news articles with approximately 300-850 tokens), we selected DUC-2001 [2] and 500N-KP-Crowd [47]. Finally, for datasets with long texts (i.e., full 9 https://faiss.ai/ VOLUME 11, 2023 text academic papers with approximately 1000-9000 tokens), we employed Semeval-2010 [48] and NUS [49]. As described in [50], the rationale behind evaluating approaches on multiple datasets of different textual lengths and thematic domains is that the different characteristics of the datasets influence the accuracy of KE approaches.
In our evaluation, we used the F1 and partial F1 (pF1) metrics, which are explained below. The reason we also use pF1 is the main limitation of the exact matching technique used by the F1 score, where highly similar strings (e.g., 'machine learning approaches' and 'machine learning methods') are not considered as a match, despite their high similarity. Recent publications that use the pF1 metric include [9], [51] and [52]. In our work, F1 is computed as the harmonic mean of the exact match Precision and Recall metrics: where: Precision = number of exactly matched keyphrases total number of extracted keyphrases (11) Recall = number of exactly matched keyphrases total number of assigned keyphrases (12) Additionally, pF1 is computed as the harmonic mean between the partial match Precision and Recall metrics: pF1 = 2 * pPrecision * pRecall pPrecision + pRecall (13) pPrecision = number of partially matched keyphrases total number of extracted keyphrases (14) pRecall = number of partially matched keyphrases total number of assigned keyphrases (15) The number of exactly matched keyphrases is measured as the number of exactly matched strings between the human assigned keyphrases and the ones extracted by the automated approaches. The number of partially matched keyphrases is measured as the number of extracted keyphrases that share a number of common terms with the human assigned keyphrases. To avoid matching keyphrases that are not highly similar, we match keyphrase pairs from both sets, when they share the maximum number of common terms. Another important implementation detail, which avoids recounting, is that if an extracted keyphrase rematches to a human assigned keyphrase, then the match count is not increased.

C. EXPERIMENTAL SETUP AND EVALUATION RESULTS
Regarding the selected approaches, we set them with the parameters recommended by their original authors to produce the top-10 keyphrases. The only exception is KeyBERT, where we used the 'all-mpnet-base-v2' model, we selected the MMR metric for deduplication with the diversity parameter set to 0.5, and we set the n-gram range to (1,4). In our approach, we used multiple configurations to determine the best setup. Firstly, we run the approach with deduplication enabled, while disabling the boolean flag keep_nouns_adj(see Section III-A) and the position metric (LMRank deduplicate ). Secondly, we run the approach with deduplication enabled, and by keeping keyphrases consisting only of nouns and adjectives, while the position metric was disabled (LMRank deduplicate_nouns_adjs). Thirdly, we run the approach with deduplication and the position metric disabled, while keeping keyphrases consisting only of nouns and adjectives (LMRank nouns_adjs ). Finally, we run the previous combination with the position metric enabled (LMRank nouns_adjs_pos ).
All approaches were evaluated for the datasets mentioned in Section IV-B with the macro F1@n (exact match F1 score at top-n keyphrases) and macro pF1@n (partial match F1 score at top-n keyphrases), where n was set at 5 and 10 keyphrases. The values of these scores are in the range [0, 1], with higher values denoting more top-n keyphrases extracted correctly. Tables 2-5 summarize our evaluation results. For each dataset we mark with bold the best performing classical approach, as well as the best performing setup of LMRank. For the embeddings-based approaches, we mark the best performing approach with bold and we underline the second-best performing approach in terms of F1 and pF1 scores.
As shown in Tables 2 and 3, MPRank and KP-Miner achieve the best performance among the classical approaches, with an exception concerning the WWW dataset where YAKE! performs best. Both MPRank and KP-Miner achieve competitive performance when compared to newer embeddingsbased approaches. The best setup for the proposed approach is LMRank nouns_adjs . When comparing the results with the position metric enabled, we see no clear benefit in terms of performance. This contradicts with the results presented in SIFRank and KPRank, which suggest that the incorporation  of this metric yields better performance in datasets with longer texts. Regarding the embedding-based approaches, SIFRank+ achieves the best performance in half of the datasets; in the other half, the best performance is achieved by Key2Vec for 500N-KP-Crowd, and by LMRank nouns_adjs for the KDD, NUS and WWW datasets.
As shown in Tables 4 and 5 (for the pF1 metric), the best performing classical approach is MPRank in half of the datasets. In the other half, the best performing methods are YAKE! and KP-Miner. The best setup for the proposed approach is LMRank deduplicate . For the embedding-based approaches, SIFRank achieves the best performance in the DUC-201 and NUS datasets, while maintaining the second-best performance in the other ones. EmbedRank achieves the best performance in the 500N-KP-Crowd and SemEval2010 datasets. KeyBERT achieves its best performance in the KDD and WWW datasets. Finally, LMRank deduplicate achieves its best performance in the Inspec and SemEval2017 datasets, while demonstrating near state-of-the-art accuracy for all other datasets.
Overall, LMRank achieves the top F1@5 and top F1@10 score over the state-of-the-art embeddings-based approach (SIFRank) in the KDD, NUS and WWW datasets. As far as the Inspec dataset is concerned, LMRank achieves a pF1@5 score of 74.8%, which is an absolute gain of 4.2% over SIFRank. Similarly, it outperforms SIFRank in terms of pF1@5 score in the SemEval2017 and WWW datasets. Regarding the pF1@10 VOLUME 11, 2023   score, LMRank exhibits similar relative performance gains over SIFRank, as is the case with the pF1@5 score.
To further highlight the strengths of LMRank over the state-of-the-art approach, we have also selected a representative document from the Inspec dataset for qualitative analysis, as appearing in Table 6.
As shown in Table 6, SIFRank finds fewer human-assigned keyphrases (highlighted in blue) and extracts longer keyphrases that do not strictly consist of nouns and adjectives (e.g., ''paper concerns biorthogonal nonuniform filter banks''), and repeats other thematically similar keyphrases, thus reducing its coverage. We argue that the novelty aspects of our approach lie in the usage of dependency parsing and deduplication, which are described in Section III-A.
These techniques enable the extraction of (i) better candidate keyphrases (by avoiding extremely long phrase patterns), and (ii) thematically distinct keyphrases.

D. PERFORMANCE BENCHMARK ANALYSIS
In this section, we benchmark the performance of LMRank over approaches with similar F1 / pF1 scores. These approaches include SIFRank+, EmbedRank+ and KeyBERT. For the performance analysis, we consider the metric of mean elapsed system time (in seconds). We start the time measurement at the point where the KE method of each approach is executed (we do not measure the ''warmup'' time of each method, during which the underlying ML models are loaded into memory). For the execution environment,  we use the Linux VM described in Section III-B. For the sample text, we used the Wikipedia entry for the topic of 'Machine Learning', 10 which consists of 8020 terms. For each run, we chunked the text in segments according to the set {250, 500, 1000, 2500, 5000, 6500, 8020}. We measure the execution time of each run (in seconds) using the mean elapsed system time metric, and we repeat this process across 20 iterations. We then calculate the average of these measurements to produce the mean system time of each KE approach.
A comparative overview of the performance benchmark appears in Fig. 3. As shown, LMRank significantly 10 https://en.wikipedia.org/wiki/Machine_learning outperforms all other approaches by orders of magnitude, except that of EmbedRank+. This is expected, since EmbedRank+ does not utilize a transformer model, which has quadratic computational complexity when increasing the input size (contrary to earlier introduced embedding models that used static vector representations). This design choice for EmbedRank+, while offering performance, sacrifices accuracy significantly. Instead, LMRank utilizes the sentence transformers package for the embeddings calculation step, similarly to KeyBERT; however, KeyBERT has an exponential runtime according to the results. SIFRank+ is slower than both LMRank and EmbedRank+, while suffering from runtime slowdowns at 5000 and 6500 tokens.

V. CONCLUSION
This article has proposed a novel approach, namely LMRank, which uses syntactic dependency parsing to extract keyphrases and leverages a highly accurate state-of-the-art sentence embeddings model to capture the semantic similarity between the candidate keyphrases and the document, the ultimate aim being to improve the keyphrase ranking process. Our approach also incorporates a SoftMax-based position metric, which re-ranks (at a higher level) keyphrases that appear near the beginning of the document. We have conducted an extensive evaluation on eight heterogeneous datasets and showcased the robustness of LMRank when documents of different lengths and topics are considered; it has been shown that LMRank not only reaches the state-of-the-art approaches in terms of accuracy, but also outperforms them significantly in selected datasets. We have also showcased the scalability of LMRank for medium and large documents, where it significantly outperforms all other approaches, except that of EmbedRank+.
With respect to future work directions, we plan to add multilingual support to LMRank through the use of a multilingual transformer model and the associated spaCy NLP models. We also consider the use of embeddings of large language models, such as GPT-3 [53], aiming to capture even more context and thus further improve the accuracy of the proposed approach.