Unstructured Text Documents Summarization With Multi-Stage Clustering

In natural language processing, text summarization is an important application used to extract desired information by reducing large text. Existing studies use keyword-based algorithms for grouping text, which do not give the documents’ actual theme. Our proposed dynamic corpus creation mechanism combines metadata with summarized extracted text. The proposed approach analyzes the mesh of multiple unstructured documents and generates a linked set of multiple weighted nodes by applying multistage Clustering. We have generated adjacency graphs to link the clusters of various collections of documents. This approach comprises of ten steps: pre-processing, making multiple corpuses, first stage clustering, creating sub-corpuses, interlinking sub-corpuses, creating page rank keyword dictionary of each sub-corpus, second stage clustering, path creation among clusters of sub-corpuses, text processing by forward and backward propagation for results generation. The outcome of this technique consists of interlinked sub-corpuses through clusters. We have applied our approach to a News dataset, and this interlinked corpus processing follows step by step clustering to search the most relevant parts of the corpus with less cost, time, and improve content detection. We have applied six different metadata processing combinations over multiple text queries to compare results during our experimentation. The comparison results of text satisfaction show that Page-Rank keywords give 38% related text, single-stage Clustering gives 46%, two-stage Clustering gives 54%, and the proposed technique gives 67% associated text. Furthermore, this approach covers/searches the relevant data with a range of most to less relevant content. It provides the systematic query-relevant corpus processing mechanism, which automatically selects the most relevant sub-corpus through dynamic path selection. We used the SHAP model to evaluate the proposed technique, and our evaluation results proved that the proposed mechanism improved text processing. Moreover, combining text summarization features, shown satisfactory results compared to the summaries generated by general models of abstractive & extractive summarization.


I. INTRODUCTION
The large number of unstructured text documents exist for use in daily life. It is not easy to process them without an automatic approach. Automatic corpus processing approaches rely on dynamic information grouping in text retrieval. The efficiency of text-documents grouping decreases the size of the large corpus. However, grouping text documents by splitting multiple text-documents into related subsets is a problematic text processing task [1], [3], [7]. Furthermore, the The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Afzal . efficiency of information retrieval approaches decreases in the absence of such grouping of corpus [2], [3]. Less efficient grouping ultimately increases the cost and effort involved in the information retrieval process. Text mining experts determine the processing cost & time by the size and information diversity of the underlying corpus [4], [5]. Text processing efficiency improves by consuming less time with a small related corpus. Therefore, the optimal use of time and cost for corpus processing is crucial in the information retrieval system. To resolve this issue, metadata analysis and relevant text extraction have a significant role in optimizing the processing effort, cost, and time [6], [8], [17].
The text summarization processes rely on text classification by machine learning to perform text mining. This extraction process attempts to extract important sentences or those sentences carrying salient words. In the text extraction process, we refer these sentences as key-sentences [2], [4], [8], [59], [61]. The reason to select abstractive & extractive summarization is to overcome sentence structures, active or passive voice, etc. These forms of unstructured-text summarizations give the language-independence advantages in textquery processing [1], [3], [9], [58], [60].
The automatic summarization of unstructured-text has various challenges and problems. The basic reason corresponds to its robustness & completeness to cover/represent the actual text, which fewer words/lines. It is challenging to guarantee that short-extracted text can give all the underlying text information. In literature, researchers study this problem of text representation as two strategies: abstraction & extraction. Abstractive or metadata processing technique gives a keywords-based theme about the text as abstractive summarization [13], [24]. Collective metadata of similar documents leads to a relevant set of documents rather than processing documents with individual metadata. The process of keywords-based Clustering generates groups of the metadata in the form of a dictionary [8], [9]. This process faces challenges like semantic-restrictions language-dependency. However, still, it has vast application over those documents which contain relatively short unstructured text. It is very much suitable for tasks such as headlines, various documents title, website or research paper keywords [5], [21], sentence compression [13], [19], [24], and sentence fusion [11], [17]. The keyword selection has many concerns and constraints, and as per literature, one cannot guarantee it as stand-alone the comprehensive text representation.
After the metadata processing, extractive summarization reduces the corpus by removing less necessary text [3], [10]. The potential sentences are either selected through the sentence score algorithm or word embedding principles, i.e., Word2Vec model [11], [18], [29]. Recently, the frequency-based weighting techniques are the preliminary researches in sentence extraction studies [9], [14], [31]. Similarly, the semantic assessment by latent analysis [19], [27], [39], Markov-models, and graph-oriented supervised and unsupervised techniques are under investigation [4], [16], [27]. In short, the extractive-summarization is a key area of research as per the current literature analysis [2], [18], [29], [37]. For simplifying the extractive summarization, the sentence selection process considers the summary-worthiness. The sentence ranking process extracts the most salient 'n' sentences. These processes essentially relate to the classification problems of machine learning. In literature, researchers mostly apply either syntactic and semantic approaches for deciding sentence scores. We implement the predetermined features during the syntactic approach, which are usually hand-crafted for those sentences that we consider for scoring the sentences [8], [13], [22], [39]. Semantic approaches involve the word's meaning and various types of text phrases to create a semantic relation [9], [17], [28], [41]. However, during summarization, both the text's meanings and their semantic relations in written sentences rely on the text's structural properties.
As per our literature review, there is an intellectual space to comprehensively combine abstractive & extractive summarization to handle the text representation, and these techniques need attention. There exist few studies which target syntactic & semantic issues in a combined manner. This combination can solve unstructured text document analysis for both the document-dependent & document-independent issues of unstructured text summarization. Although the presented researches have shown the capacity of identifying the relevant text, even then, these studies have less attention to minimize corpus diversity by creating a runtime relevant reduced corpus by utilizing multistage Clustering & summarizations. These studies rely only on sentences and overlook various syntactic text, which exists in unstructured text. Therefore, these techniques are not mature or final solutions for handling diverse information in an unstructured text corpus.
The proposed technique is a new text processing way to give a path-oriented visualization to the actual large corpus. The current study's first objective is to systematically create improved corpus visualization, combining clustering, abstractive, and extractive summarization. This form of corpus visualization has path-oriented query processing to handle any extended text corpus's diverse nature. A further objective of this study is to perform improved metadata analysis by relying on pre-extracted text. This study binds the keywords metadata with reduced extracted text, and it achieves the goal of extracting crucial parts of the actual-text. Another objective of this technique is to develop a mechanism that has combined features of single&multiple document summarizations. We have achieved all these objectives by interlinking single document summarization metadata through clusteroriented inter-linking & summarization.
This research article consists of eight parts. Section-I describes the main aspects of the paper and contains a complete description of the current research paper. Section-II gives a briefing about the existing related work. It shows the comparison of the associated studies. Section-III provides an overview of the techniques applied in the current study. The next two sections cover the methodology and results. Section-IV is about the implemented steps of the proposed technique. The experimental results as graphs and tables, including their brief Description, are part of Section-V. Section-VI is the summary of our presented work. We have mentioned the detailed summary of the entire work in Section-VII. Section-VIII is the conclusion of this article.

II. RELATED WORK
Several text processing systems exist to process unstructured textual documents from multiple unstructured sources of free text. The existing information extraction methods use the dictionary-based approaches, pattern matching approaches, VOLUME 8, 2020 and rule-based systems [14], [15]. The detail of related work is as follows: Liu, K., & El-Gohary, N. (2017) applied abstractive summarization over maintenance action reports of bridge conditions and processed them by applying automatic metadata analysis. Abstractive extraction served as a tool for bridge deterioration prediction related to decision making about bridge maintenance. Their work presented a semi-supervised information extraction procedure for decision-making information mining. Researchers generated keywords dictionaries to unfold current deficits and maintenance actions from inspection reports of the bridges. Their approach defined dependencies structures by abstractive summarization for both labeled and unlabeled data. Ying, Y. et al. (2017) implemented abstractive text summarization for key phrases extraction. They applied a graph-based technique for key phrases ranking during text extraction. They selected an abstracted text extraction approach to create connections between the keywords and ignore the sentencing impact. They proposed certain types of metadata relationships between standard terms to their related sentences. First, they grouped multiple documents as a set of clusters, and these clusters presented the main topics of the given papers. They developed three different metrics and proved their work better than just extracting the key phrases. Gupta, A. et al. (2018) analyzed the radiology reports by abstractive summarization and Clustering to overcome the previous difficulties of manual text processing. They improved dictionary-based and rule-based approaches with better keywords extraction and text grouping. At earlier processing techniques, the named entity recognition process missed the cluster relationship analysis. They extracted named entities relations by unsupervised approach, without prior knowledge. They covered text processing dependencies by parse trees and related them in the form of distributed semantics.
Moradi, M. (2018) used extractive text summarization and reduced the text by eliminating fewer essential parts. This technique applied text summarization over bio-medical data, and they named it as Clustering & Itemset-mining Biomedical Summarizer (CIBS). This approach processed the text input to extract biomedical concepts. The concepts represented the main topics by applying the itemset mining algorithm over reduced text, and CIBS placed sentences into the clusters' relevant set. Azadani, M. N., et al. (2018) worked over abstractive & extractive text summarization. They developed a solution for scientific and clinical literature summaries to maintain valuable, informative content. They applied graph-oriented summarization in domain-specific knowledge mining for frequent itemset identification. This summarizer implemented the Unified Medical Language System and provided a conceptbased text mining model. It processed source documents by mapping concepts to actual records. It provided interrelated item-sets with correlation similarity function to generate graphs. Uçkan et al. (2020) proposed a text summarization technique names KUSH. This technique identifies the maximum possible sets of un-overlapping abstractive summarization. They marked these sets as nodes to determine the context of various paragraphs in unstructured text documents. They focused their work to generate coherent text visualization. Their proposed KUSH technique used abstractive summarization, integrated with set theory and graph visualization methods.
Sanchez-Gomez et al. (2020) developed a multi-objective abstractive & extractive summarization-based context analyzer. Context defines critical objectives and operational usability. For the functional usability analysis, the keywordsbased approach identified the main parts of context in the paragraphs. The summarizer utilizes pre-defined criteria for text selection. This work focus on the rising demand for automatic text summarization methods. Deng et al. (2020) pointed out the lack of unknown words and incomplete sentences. They presented an abstractive-text summarization approach as an alternative to the sequenceto-sequence text summarization models, which integrated text-summarization with sequence-to-sequence models in Chinese text assessment. They included adversarial learning in their proposed text summarizer. Their comparison results proved that the proposed method improved text assessment with the addition of abstractive text summarization.
Mohd et al. (2020) tried to capture & preserve the text's semantics as the fundamental feature for summarizing a document. To generate high-quality text, summarization researchers applied the distributional-semantic-model with abstractive & extractive summarization. These summaries proved suitable by ROUGE-summarizer over the DUC-2007dataset. The main contribution of this article is to include summarized semantics as part of text assessment features. Bidoki et al. (2020) worked over a multi-document text summarizer and developed an extractive summarizationbased semantic framework. This semantic framework combines machine learning & graphs with text summarization and extracts sentences representing semantics from a set of text-documents with the word2vec model. It calculates meaningful sentences by applying graph relations to the documents. This method helped to identify relevant topics to text documents.
Mutlu et al. (2020) presented a new dataset for abstractive and extractive summarization tasks in this study. This dataset consists of miscellaneous academic-publications with the abstracts and the human extracted text from these papers. This method combines the keywords of these three parts and joins them with the critical extracted sentences. This form of academic publication presentation helps to assess the validity of the text extraction process. This method reinvestigates the robustness of the extractive summarization process over their dataset. This study focused the semantic features by using GloVe & word2vec embeddings.  Table 1 contains the summary and comparisons of the features of the existing studies. Most of these approaches involve significant text processing and ultimately consume more time. In these approaches, less attention focused on creating a strong relationship between metadata to actual text with some pre-extracted text. These techniques handle the metadata quality to overcome the corpus's diversity to obtain the relevant text query results. Therefore, these studies relate to our research regarding the enhanced corpus processing visualization. Table 1 presents the essential work and limitations of the described tasks, and it shows the salient features of our proposed technique [1]- [11].

III. TEXT PROCESSING TECHNIQUES OVERVIEW
Text processing relies on multiple approaches to identify the exactly required text. These approaches are different as compared to the data processing applied in spreadsheets and databases [6], [17]. As the initial step, text processing techniques analyze the textual data using morphological analysis and wordless analysis [8], [13], [21]. Next, these techniques classify text into useful to less useful grouping [16], [17], [21]. This section contains a description of the text processing techniques.

A. TEXT PRE-PROCESSING
Preliminary text analysis is the first step in text processing and comprises information search, formation, and extraction by text mining methods [11], [16], [28]. The data cleansing removes elements like exclusive characters, punctuation, and tags. All such items are useless for the underlying text processing and only result as unnecessary noise [5], [8], [21]. However, there exists no fixed rule to categorically declare the specific noise in any type of textual data. Removing anything from the source text depends on the problem statement [2], [14], [19]. Generally, after further tokenization, steps are removing stop words, stemming, and lemmatization [4], [13], [22].

B. DICTIONARY-BASED APPROACHES
There are two different approaches for Morphological Analysis (MA), i.e., dictionary-based MA and MA without dictionaries. The vocabulary of the language provides a base for MA. The wording of natural language has many compound words or constructions using punctuation marks like hyphens and apostrophes [13], [25], [41]. The vocabulary approach comprising MA faces two challenges, i.e., the incompleteness of the dictionaries and the constant appearance of new words [21], [22], [49]. In MA methods, wordless analysis supplements all those words which are not present in the dictionary. Wordless morphology uses the regularities according to inflections and grammatical meanings [21], [37], [44]. The principle of this morphology relies on analyzing the end of a word for predicting its morphological features. [26], [37]. The English language MA is based on the use of finite state machines and finite state transducer [23], [24], [31].

C. TEXT RELEVANCE BY TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
The method of TF-IDF determines text relevance in search engines. TF is the ratio of the number of occurrences of a certain word, to the number of words in the document. For example, if a document contains 500 words, and the word 'cosine' occurs five times, then TF = 5/500 = 0.01. IDF is the inverse document frequency and the inverse of the frequency with which a word occurs in a large collection of documents. This collection consists of all the pages which indexer processes. For example, if the collection contains 10000 documents, and the word 'cosine' appears in 100 documents, then the value IDF = log2 (10000 / 100) = log10 (100) = 2. Thus, popular words occurring in every tenth document more often will have IDF<1. The words found in every hundred documents and occurs less often have IDF>2. Almost all topics characterization likelihood, by certain words, has IDF close to 2 [24], [25], [44]. TF-IDF gives maximum value if rare words have many occurrences in the document [26], [29]. This indicator calculation works as below: This algorithm processes a collection of documents linked by hyperlinks such as web pages. It assigns each of them a numerical value to measure its importance relating to other documents [11], [17], [33]. This algorithm applies to any set of objects joined by reciprocal links. The Page Rank (PR) value indicates the significance of a page. This value depends on the number and PR values of all pages that contain links to the current page [27], [43]. If each of pages B, C, and D have links only to page A, then PR (A) will be equal to the sum of PR (B), PR (C), and PR (D) since all links in this simple method will point to A:

E. WORD EMBEDDINGS AND TEXT SUMMARIZATION
Word2Vec model, trained on the body of text, maps words into a small dimension vector space. It keeps the distance between words smaller and closer to the meanings of the words, which occur in close relations [15], [29], [46]. It requires the training of an artificial neural network to predict context from a vector of words. This mapping gives those words which appear in similar contexts to close vectors [28], [30], [43]. This technique gives the cluster of words occurring close in real-time. These words keep getting closer or keep moving away as per the passage of time. This technique is a useful method for topic or subject assessment [17], [33], [48]. In our technique, we have used Python language package Gensim-Word2Vec for text summarization.
This extractive text summarization technique detects text lines with words of high mutual occurrence in real space [53]- [55]. Table 3 shows the comparison of sentence score and Gensim-Word2Vec abstractive summaries.

F. K-MEANS CLUSTERING
KMC has an iterative stabilization process for cluster centroids [29], [39], [43]. The main characteristic of a cluster is its centroid. This algorithm stabilizes or at best completely stops the change in the centroid of the cluster [31], [34], [35]. At first, it selects the initial centroids for multiple documents. All document distribution takes place among the clusters. The document falls into only one cluster; the metric of the proximity of the centroid of the document is important. The centroid calculation of the cluster relies on the new set of documents in each cluster. The cycle of calculations repeats if the centroid of the cluster has moved. Otherwise, if the centroid has stabilized, the clustering process completes [32], [33], [46]. The algorithm's theoretical speed is O (n); n is the number of documents in the set. This algorithm has a linear rate and uses the values of the matrix TF-IDF. This method does not need training, and if necessary, it can accumulate information to increase further the accuracy of work using Bayesian estimates of clustering parameters [36], [50].

G. COSINE SIMILARITY
Cosine Similarity (CS) is a good measure to give the similarity between different articles. It recommends those most likely items in which the users are interested. Item similarity recommendation depends on the value of CS [28], [30], [36]. CS has the following formula. CS is the right choice when attributes are high-dimensional, especially in information retrieval and text analysis [37], [38], [42].

H. ADJACENCY MATRIX
An adjacency matrix represents a graph by a matrix ( . The element f (i, j) has the edge (i, j) attributes [20], [39]. The edges without attributes have a graph representation as a matrix to save memory, as shown in Figure 1.
We can also define it as below. Let G is the graph having, n vertices ordered from v 1 to v n . Matrix A (n × n), in which a ij = 1, in a path exists from v 1 to v j a ij = 0, if path does not exist The above is adjacency matrix [39], [46].

IV. METHODOLOGY OF DYNAMIC CORPUS CREATION BY MULTI-STAGE METADATA EXTRACTION
This study presents a Dynamic Corpus Creation (DCC) approach for information retrieval with multi-stage metadata selection. It helps to reduce the cost and time for information extraction. It has multiple steps of corpus processing which jointly utilize the principles of both single & multiple document summarization. Section-IV-A to Section-IV-J contains the step by step description of DCC.

A. PREPROCESSING OF TEXT
The unstructured text has specific abnormalities and unnecessary elements. Removal of these redundant elements saves the overhead of cost and time during text processing. Our dataset for experimentation consisted of over thirty-five hundred long paragraphs, having miscellaneous topics [52]. This step resolved the text processing issues of handling of UTF & none-ASCII characters. This issue is step 1 in Figure 2.

B. MAKING MULTIPLE SUMMARIZED PARALLEL CORPUSES
The corpus having long text documents is prone to irrelevant document processing [40], [41]. The proposed approach combined abstractive & extractive techniques. The second approach of crucial line extraction has two other methods. VOLUME 8, 2020 Step by step hierarchy of FSC.
In the first approach, the sentence score algorithm extracts the high score text lines [39], [48]. In the second approach, the word embeddings algorithm extracts those lines with a high frequency of mutually occurring words. We also refer to the word embeddings as Word2Vec model [31], [37], [44]. We converted the unstructured text corpus into the three parallel reduced corpuses. i.e., Gensim Based (GB) Word2Vec Summarized Parallel Corpus (SPC), FB-SPC, and Page Rank (PR). We presented all three corpuses as a tuple in Table 3. Each tuple consists of the actual text, GB reduced text, frequency-based reduced text, and page rank keywords, shown in step 2 in Figure 2.

C. FIRST STAGE CLUSTERING
We applied K-Means clustering to divide a single large corpus into multiple sub-corpuses. Contrary to the traditional approaches, our approach used reduced corpus as a substitute for the large corpus [46], [47], [51]. In our experiment, the extended corpuses compared with the actual text by calculating CS values. We applied KMC and produced first stage clusters from the GB-SPC. In Figure 2, we have presented it as step 3.

D. SUB-CORPUS CREATION
In our approach, the Group of documents relating to one cluster is a sub-corpus. Sub-corpus consisted of closely related parts of a larger corpus. We made a comparison of KMC clusters with the entire AEC by calculating CS values. We identified the documents relating to a cluster through these CS comparisons. Documents with high CS towards a specific cluster associated with one sub-corpus. Distinct clusters represented cluster-level corpus summarization. We have mentioned this summarization as step 4 in Figure 3.

E. INTERLINKING MULTIPLE SUB-CORPUSES
We used cluster similarities to join clusters in a path-oriented manner. The adjacency matrix and adjacency graph provide interlinked processing of SC. By using the adjacency matrix, we related each cluster to all other clusters. This matrix has c 1 , c 2 , . . . , c 24 rows and columns, as presented in Table 4. In this path-oriented approach, the processing begins from Step by step processing of SSC. the most relevant sub-corpus and gradually tends towards the inter-linked sub-corpuses. We have mentioned it as step 5 in Figure 3.

F. DICTIONARY OF KEYWORDS OF SUB-CORPUS
After the corpus creation, the PRK of sub-corpus collectively combined as a dictionary of keywords. Each document in our experiment contains a specific set of unique keywords, and these keywords form a superset keyword dictionary of a particular sub-corpus. We have processed these dictionaries of sub-corpuses in second stage K-Means clustering. The collection of FSC clusters and sub-corpus dictionaries jointly form enriched metadata to identify the required text. We have mentioned it as step 6 in Figure 3. This metadata includes the first layer of query processing in the proposed approach, as presented in Figure 4.

G. APPLY SECOND STAGE CLUSTERING (SSC) SC
In our approach, the second stage KMC applied over grouped PRK of sub-corpuses. It further identifies relevant portions of SC and related it to a specific cluster. KMC produced unique sub-corpus clusters. CS calculated for each cluster to determine the relevant part of the sub-corpus. We have mentioned it as step 7 in Figure 3.

H. CREATING LINKED PATHS OF SUB-CORPUS
After the second stage K-Means clustering, these clusters inter-related to each other in the same manner as followed to interlink the first-stage clusters. An adjacency matrix for one of sub-corpuses has c 1 , c 2 , . . . , c 6 rows and columns as presented in Table 5. All the remaining sub-corpuses have similar matrices.

I. TEXT PROCESSING BY FORWARDING PROPAGATION
We created systematic processing layers for hierarchical text processing. The first two layers consist of the clusters of FSC and SSC, joined in a path-oriented manner. The third layer contains keywords of the documents. The fourth layer has the GB summarized text, as presented in Figure 6. We generated the Query Specific Dynamic Corpus (QSDC) by processing the queries through these layers. At first, the query matches the clusters' set in the first layer to identify the relevant clusters. Secondly, the query matches with the clusters of the sub-corpus. Thirdly text query matches the keywords of selected potential paragraphs. Next, the query matches with relevant summarized GB extracted text. After the complete processing, the technique extracted relevant actual text at the last step. We have mentioned it as step 9 in Figure 5. The proposed path-oriented process reduces corpus comparisons  and provides three primary processing outcomes, i.e., QSDC, strongly related corpus segments, and query processing metadata.

J. TEXT PROCESSING BACKWARD PROPAGATION
In this step, we performed text processing over the marked set of potential paragraphs. The extracted text relied on the keyword's selection, text extraction, and KMC. The current technique generates the text query results either by permanent text extraction for further processing or uses text temporarily/ ad-hoc textual-results. The query processing starts from strongly to weakly related nodes. Subsequent query processing relies on the metadata by applying the principles of neural networks. More relevant text portions have default priorities in text processing. Text queries kept in processing history with identified related metadata. The queries related to metadata facilitated to classify similar questions based on clusters & text similarities. We have mentioned it as step 10 in Figure 5.

K. DESCRIPTION OF ALGORITHM
This algorithm improves query processing. It executes text processing in a staged manner. At first stage the most related major (C k ) and sub-corpus (C p ) cluster generate reduced corpus. For example, if the 20% corpus relates to the given query, then ultimately 80% corpus processing becomes reduced. At the next stage, the query comparison with abstractive summarization identifies more related text. For example, if the related text is 13%, then 87% corpus processing becomes reduced. At the last stage, query & extracted summarized text comparison provides most related actual text documents. This algorithm utilizes interlinked metadata of a complete corpus and relies on the actual text as well.

V. RESULTS AND DISCUSSION
Our proposed DCC approach processed a News dataset [52]. This dataset has more than thirty-five hundred long paragraphs with miscellaneous topics. In our experiment, we used long paragraphs of an average of eight hundred words from this dataset. We have applied Python language package NLTK to perform tokenization. Unnecessary token removal gave a useful set of tokens. We applied procedures like speech tagging, stemming & lemmatization, and identified the various types of text-tokens.

A. SUMMARIZED PARALLEL CORPUSES
There are two approaches to assess the theme of underlying text documents without processing the whole-text, i.e., 1) Creating the dictionary of keywords, 2) Extracting the crucial lines of the unstructured text. We generated the reduced corpuses by applying these text reduction techniques as described in Section-IV-B, and this step reduced the large corpus by extracting important words & lines from it as presented in Table 3. This step makes text query rely on reduced actual-text & page rank keywords, and Figure 7 shows the text reduction statistics obtained by the three text extraction techniques. The corpus processing becomes almost seventy percent reduced by extractive summarization, as shown  in Figure 7. We have the text reduction statistics in the form of a bar-chart in Figure 7. As per our experimentation, GB-SPC & FB-SPC consists of almost one-third of the actual text. The extracted-text becomes approximately thirty percent of the total text. PRK-SPC gave the top keywords from each paragraph. PR keywords of the given paragraph obtained from the pre-processed text cleansed text.

B. SELECTION OF REDUCED CORPUS TO APPLY FSC
We identified similar documents by applying the K-Means clustering process. KMC gives optimal results in many situations [17], [34], [46]. We compared the GB reduced text with {Actual-Text, SS-reduced text, PRK} by calculating the CS values. Similarly, the SS-reduced text compared with the {Actual-Text, GB-reduced text, PRK}. Figure 9 shows the comparison results obtained from GB & SS comparisons. GB reduced corpus has better text similarity to the actual text. Based on these results, we selected GB-reduced corpus as a substitute for the actual-corpus. We compared KMC from GB-corpus with each tuple, as illustrated in Table 3. Figure 4 is the representation of corpus view after applying the first stage K-Means clustering. Table 3 represents the formation of the underlying dataset.

C. CREATING SUB-CORPUSES BY USING KMC
We applied FSC to the GB reduced corpus to generate the set of KMC clusters. Each cluster matched with actual and extended corpuses. The sub-corpus creation followed the steps as mentioned in Section-IV-D. Figure 8 represents the distinct first stage clusters. We used twenty-four distinct clusters to identify the related groups of documents. We calculated the CS values between these clusters and placed them in the next AEC columns, as presented in Table 3. Based on text similarity, we grouped a document under a designated sub-corpus. This step created a cluster-based text grouping of a large corpus into manageable sub-corpuses. Figure 10 presents the sub-corpus view, and each cluster in this figure has a specific set of documents. These sub-corpuses form a subset of similar documents in one large corpus.

D. CONNECTING SUB-CORPUSES BY DENSE GRAPH
We interrelated the clusters in a path-oriented manner by following the steps mentioned in Section-IV-E. We developed a dense adjacency matrix, as presented in Table 4. Figure 11 represents this matrix by an adjacency graph to show the VOLUME 8, 2020  connected nodes of sub-corpuses by weighted paths. This graph provides path selection during corpus processing. Each row in Table 4 depicts a cluster's relation to another cluster, and vice versa. AM of each SC presents a complete mapping of clusters. We generated adjacency lists by processing the AM & AG. These lists relate every sub-corpus to its related set of sub-corpuses.

E. SUB-CORPUS KEYWORDS DICTIONARY, SSC, AND INTERLINKING SUB-CORPUSES
We created the keyword dictionaries of each sub-corpus, as discussed in Section-IV-F. We applied the second stage KMC over these dictionaries to perform further sub-grouping of the documents. In this step, unique words of multiple documents jointly form the collective dictionary of each subcorpus. We refer to it as the multi-document abstractive summarization. It provides the collective text relevance, specific to a group of documents [6], [14], [29]. For each cluster, we obtained CS values for a tuple of actual and extended corpuses. We linked the clusters of sub-corpuses by a sparse adjacency matrix. Table 5 presents the adjacency matrix of one of the sub-corpuses. In Table 5, each row shows the relation of a cluster to another and vice versa. AM of each sub-corpus presents a complete mapping of clusters. We have mentioned it as step 8 in Figure 5. By utilizing this matrix, path-oriented information processing follows weighted edges between connected nodes, as shown in Figure 12. This graph shows the relationship in the form of weights between the inter-related clusters of the sub-corpus. Figure 21 presents the corpus view after complete processing.

F. EVALUATION AND COMPARISONS OF IMPLEMENTED TECHNIQUE
Our approach has many advantages over traditional metadata extraction and processing techniques. We have carried out different text processing comparisons to assess the proposed  method's efficiency by executing text queries. We processed these text queries by six other techniques-this analysis performed in the same way as discussed in Section-IV-I and IV-J. This section presents the results of the comparisons of different techniques. In our experimentation, the path-oriented text processing gave better results than other commonly used text processing techniques. Abstractive Summarization: Every webpage has specific keywords to get the attention of audiences. In our experimentation, this technique provides less than 7% relevant text, and it involved 100% text document processing, as presented in Figure 14. This technique uses human analysis and sometimes gives misleading results because of human errors, and it has widespread implementation as the simplest form of metadata. TF-IDF based Keywords selection gives textbased keywords. These keywords have better relation to text as compared to the simple keyword's selection techniques. These two forms of abstractive summarization are similar. Both approaches follow the dictionary approach and both approaches depend on selecting important terms as keywords. In both techniques selection of misleading words causes wastage of time and resources. In our experiment TF-IDF, based PRK, utilized 100% text documents. The text similarity  remained about 7%, as presented in Figure 15. However, it provided better results than the simple keyword selection process, as presented in Figure 20.
Clusters based Text Summarization: In our experiment, First Stage & Second Stage Clustering generated cluster level corpus summarization. These clusters facilitate to determine the corresponding portions of a corpus. We carefully matched the clusters with all forms of existing & extended-corpuses and selected those clusters that gave adequate similarity with TFIDF-PRK, word2vec-reduced-text & sentence-score-text. This technique generates the topic assessment in text analysis studies [4], [17], [21], [25]. These clusters serve as nodes in our adjacency graph, and various clusters form a weighted path for corpus traversal and processing. This technique generates efficient path-oriented metadata. The first stage clusters gave about 10% text similarity and processed approximately 50% documents, as presented in Figure 16. In the second stage, K-Means clustering over the sub-corpus dictionary of words formed many interconnected connected nodes of a text corpus. These nodes provide intense corpus traversal in which dense graphs serve as a central entity. Our experimentation shows that the cluster-based text query processing facilitated better text analysis with more than 10% text similarity and better-observed result satisfaction. This technique processed less than 40% of text documents, as presented in Figure 17. This technique has better text processing VOLUME 8, 2020   than simple abstractive text summarization, as presented in Figure 20.
Abstractive & Cluster-based Summarization: We joined query processing with cluster & keywords selection. In this way, one cluster becomes connected with the sub-corpus keyword's dictionary & individual document's keywords. This form of text processing is an improved form of abstractive text summarization based on a better set of related text documents. This text processing gives better text similarity as compared to just relying on abstractive text summarization. It gives about 10% text similarity with processing approximately 40% actual text documents, as presented in Figure 18.  However, multistage Clustering has better text processing results than Clustering and dictionary-based abstractivesummarization mechanism, as presented in Figure 20.
Path-oriented approach with FSC & SSC to select the subcorpuses: The sub-corpus cluster's processing with metadata and summarized text causes the extraction of most related text documents. This technique provides path-based metadata processing and utilizes the advantages of all the text extraction and summarization techniques. This technique offers an efficient starting point for corpus processing. In this way, text processing resources become pre-prioritized in a systematic fashion. The significant unmanageable corpus becomes small and less diverse. In this way, the query-based dynamic corpus becomes available. Each query has its own set of the cluster, dictionary of keywords, and extracted text. This reduced form of metadata utilized actual extracted-text and facilitated the most query-relevant real documents presented in Figure 19. This technique uses about 33% text document and provides approximately 20% text-similarity in actualdocuments. This technique process 66% fewer documents and give four times better results. As compared to simple cluster summarization, it utilizes half text document and gives two times better results. As compared to all the other techniques, this technique provides much better text satisfaction results, as presented in Figure 20. Further, it facilitates the corpus enhancement by placing additional text in the most related sub-corpus.

G. SHAP EXPLANATION
The purpose of the SHAP is the comprehension and the prediction of the given instance. We carried out this task by predicting, computing, and assessing the proposed model's basic feature's contributions. The SHAP method performs the explanation by computing the estimation values. These estimation values are Shapley values, which we have computed from the coalitional game theory. It assesses the values of given features as the data about players who are playing in a coalition. Shapley values fairly distribute the payout/prediction among the features. In this model, a player is either an individual feature or a Group of features. A player can also be a group of feature values [11], [14], [31], [43]. Figure 22 shows the contribution of various text processing. Among these simple keywords-based approach has less contribution to assess the relevant documents. The clustering approach provided improvement in the proposed model. Multistage Clustering, based on the summarized-text, has a key role in identifying the related text documents.

VI. SUMMARY
Section-I formally defines the contents of our paper. It provides an understanding of the document summarization, text clustering, and cluster-level summaries. It makes the technical work simple to follow in other sections. It shows how we are stringing together various standard information retrieval techniques and machine learning approaches by stating their reasoning. Our news dataset has diverse nature and carries various topics that justify the Clustering as a reasonable idea. The experimentation presented in Section-V confirms that this technique reduces the computational cost. It handles information diversity by following the principles of text summarization and path-oriented text processing. Section-II contains various corpus processing applications, and it covers the main issues and findings of these related studies. These studies show that efficient corpus processing helps decision making for multiple types of reports analysis [1]. Text processing has certain text aspects that affect the text query results like critical phrases, frequent words, and sentencing impact. Data structures like graphs and trees help serve text aspects as linked information paths to represent these elements. These interlinked paths enable to manage the corpus diversity and generate a convenient corpus traversal during information processing and extraction [2], [3]. The traversing edges and nodes can use summarized text with adequate accuracy. Text summarization techniques extract the crucial part of the given text, and it dramatically helps assess the actual theme of unstructured text. It also helps to judge the quality and effectiveness of text query results, with a reduced cost. There exist various forms of text summarizations, and their selection depends on the processing requirements [4]. Automatic text assessment and corpus processing give better results with reduced cost and time [4], [5]. This section shows the relevance of the current technique with other studies over corpus processing. Section-III contains a brief description of all those techniques, which we have applied in the current study to generate metadata and text extraction. Every text analysis initiates from the text pre-processing and mostly include the dictionary analysis. This step remains intact in many applications of natural language processing [4], [7], [11]. Assessing the existence of the desired text in unstructured text documents requires a multi-document review. All the webpages and informal documents have keywords or page rank keywords to represent the actual text document's theme. TF-IDF and PRK generally help to assess the contents of relevant text documents [7], [16], [26]. However, for in-depth text assessments, multiple text processing techniques require systematically joined to relate text segments [9], [14]. This section has briefed about word embeddings, cosine similarities, K-Means clustering, and adjacency matrix. Our corpus processing technique has hierarchically joined these techniques to perform information retrieval by path-oriented corpus subgrouping. Section-IV presents a way to convert the miscellaneous mesh of textual documents into the metadata-based interlinked path-oriented corpus. It has two fundamental concerns: first generating the path-oriented mechanism, and the second is text query processing. At first, we presented the large corpus by three different reduced corpuses by applying text extraction. Through experimentation, Word2Vec reduced text selected to represent the actual corpus and to perform KMC. We have divided the actual-corpus into multiple sub corpuses by applying first stage KMC. Then we have interconnected these clusters into dense AM & AG as a path-oriented corpus representation, and we have further applied KMC over a dictionary of sub-corpus-keywords. We have interconnected the obtained clusters with sparse AM & AG. This procedure provided weighted connected paths during text processing. At the second stage, the query processing starts with comparing the first-stage clusters to identify the relevant clusters. For this, the dense adjacency graph with weights paths facilitates the selection of the appropriate sub-corpuses. The related sub-corpus processing utilizes the sparse adjacency matrix-finally, the related actual documents identified as the processing outcome. Table 2 presents this process in the form of an algorithm. We applied Algorithm 1 by using Python Language (NLTK and Gensim packages). Table 3 shows the sample results of this algorithm. Section-V contains the implementation results of our technique. It shows that joining the metadata with reduced pre-extracted text gives better text processing results. Using this form of metadata, our DCC converted a broad set of documents into small manageable subgroups. We presented these subgroups/sub-corpuses as nodes and joined by weighted paths in AG. These weighted paths contain the CS values to explain the extent of the relation between any two subgroups of documents. These weighted paths provide a traversal approach for the corresponding portions of the corpus. This form of interlinked processing dynamically identifies the strongly related to less related parts of the corpus. This technique works efficiently for the text query by systematically applying the existing text processing approaches. This section presented the results of our technique by tables and graphs. The results presented for the path-oriented corpus processing, proven better than the approaches that do not necessarily apply the path-oriented text processing.

VII. CONCLUSION
The current study's contribution is a technique to overcome the diversity of unstructured text with reduced cost, as the proposed technique provides a language-structure independent mechanism. It utilizes several techniques to identify the corpus' related parts. One of our contributions is an improved corpus visualization instead of dividing the corpus into disjoint sets. We proposed an interrelated-nodes approach that combines homogenous parts of text-document based on words & sentence similarity estimations. Our other contribution is the constructive implementation of the K-Means clustering approach to reduce the redundancy in information processing. Our experiment results show that the proposed technique gives improved performance. The page rank keywords, sentence score algorithm, and Word2Vec model jointly give better similarity estimation. This suggested combination effectively produce more coherent and informative query results and reflects the salient documents.
This technique is practically advantageous and supports all the dictionary-based text processing algorithms. The suggested text processing mechanism can highlight the given text is more and less critical parts during query processing. This technique is an approach that search engine can utilize for any form of unstructured text documents. This proposed work targets the corpus' homogenous parts and identifies the related text for better text query processing. It follows a practical approach by focusing on documents, more related documents, and the most related contents.
Although we have presented different improvements in our presented technique, it also has limitations as well. First, this technique ignores text-summary & sentence positions. It may cause less reading ease for generated summaries. Moreover, we have not considered the structural similarity & features of the language stylometry in our text similarity estimation. Every text engineering & data science research has some inherent limitations of text biasedness, polarity, incomplete text, and spam. Therefore, these issues require more interest in related further research.
This presented technique suggests various future implementations to generate corpus visualization with node & weighted-paths to perform knowledge mining, multiple document processing, and other contents analysis. The research over text summarization has critical applications like information-coverage with reduced text, query processing with redundancy-reduction, topic assessment, and improving readability with short-text. For all these areas of interest, text summarization serves as a multi-objective problem-solving approach. It is our futuristic perspective of later studies in the field of text mining.
MUHAMMAD AWAIS is currently working as an Assistant Professor and an Incharge of the Department of Software Engineering, Government College University Faisalabad (GCUF).
RAMZAN TALIB is currently working as a Professor and the Chairman of the Department of Computer Science, Government College University Faisalabad (GCUF).
MUHAMMAD YOUNAS (Member, IEEE) received the Ph.D. degree from the Faculty of Engineering, School of Computing, Universiti Teknologi Malaysia (UTM). He is currently working as an Assistant Professor with the Department of Computer Science, Government College University Faisalabad, Pakistan. His research interests include software engineering, agile software development, cloud computing, and code clone detection.