Journals & Magazines >IEEE Open Journal of the Comp... >Volume: 6

Reducing Data Volume in News Topic Classification: Deep Learning Framework and Dataset

Abstract:

Withthe rise of smart devices and technological advancements, accessing vast amounts of information has become easier than ever before. However, sorting and categorising ...Show More

Metadata

Abstract:

Withthe rise of smart devices and technological advancements, accessing vast amounts of information has become easier than ever before. However, sorting and categorising such an overwhelming volume of content has become increasingly challenging. This article introduces a new framework for classifying news articles based on a Bidirectional LSTM (BiLSTM) network and an attention mechanism. The article also presents a new dataset of 60 000 news articles from various global sources. Furthermore, it proposes a methodology for reducing data volume by extracting key sentences using an algorithm resulting in inference times that are, on average, 50% shorter than the original document without compromising the system's accuracy. Experimental evaluations demonstrate that our framework outperforms existing methodologies in terms of accuracy. Our system's accuracy has been compared with various works using two popular datasets, AG News and BBC News, and has achieved excellent results of 99.7% and 94.55%, respectively.

Published in: IEEE Open Journal of the Computer Society ( Volume: 6)

Page(s): 153 - 164

Date of Publication: 18 December 2024

Electronic ISSN: 2644-1268

DOI: 10.1109/OJCS.2024.3519747

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Over time, technological advancements combined with the evolution of the Internet have allowed us to reach an ever-increasing number of individuals. Furthermore, the widespread use of smart devices has enabled users to connect to the network anywhere at any time. This scenario provides the possibility for everyone to readily access information published on the network and enlarge their cultural horizons. However, this continuous flow of information, in the form of posts, blogs, news, and in general, text documents, led to chaotic and enormous offerings that must be sorted to face up to particular services. For instance, using algorithmic methodologies, a social network aims to categorise textual posts, or an email service provider organises the email view for its user's needs [1]. Thus, the urge to define class-oriented services to summarise the content of each document is an open challenge. Therefore, during the last years, the necessity of sorting online documents in this scenario implied an explosion in research for text classification systems due to the urgency of finding new methodologies to avoid time-consuming and costly processes, e.g., manual procedures. Text classification solutions automatically assign categories to textual documents. These solutions are used in various applications, such as question-answering systems, sentiment analysis, user recommendations, text filtering and summarisation, fake news detection, and topic or category prediction [2], [3]. To better understand the importance of topic classification, let us refer to a hypothetical use case, where an application is able to detect fake news and suggest the users with the real counterpart, so that users are able to read the real news. In order to fulfil this scenario, the application does not only have to be able to understand if news articles are real or fake but also to classify the topic and find any other news on the same topic. Another application of the proposed algorithm is related to conversational agents (CAs): any CA has to understand the discussed topic in order to provide useful insight back to the user, which is only possible if the CA is able to infer and connect different topics discussed at different times.

In all these applications, the input is a text document, and the output is a set of labels specific to the classification problem, such as sports and politics for topic classification or fake and real for fake news detection. However, some difficulties emerge in addressing this issue, such as the differences concerning disparate text lengths, from social media, which are usually short, to long clinical documents, and middle-long lengths, e.g., an email that can be labelled as spam or a news article on the web for predicting its topic. Moreover, news topic extraction techniques face, in the literature, issues such as data sparsity and memory overload due to the massive dimensionality of natural language, and thus, there is the need to convert text into numerical form for processing with artificial intelligence. Currently, only compute-intensive algorithms are proposed and are often unfeasible for low to middle-performance computers. Additionally, the limited availability and diversity of datasets is a persistent issue, as these datasets often only cover a few topics, and limited to only a few news sources, leading to generalisation issues. The article provides the following contributions:

We propose a topic classification framework for news articles, combining well-known techniques, such as Bidirectional LSTM (BiLSTM) and attention mechanism. Experiment results, evaluated on a wide variety of classification datasets, demonstrate our system outperforms the other existing methodologies in the literature in terms of accuracy.
We introduce a methodology to decrease the data volume by using an algorithm to extract the key sentence of a document in order to enhance the efficiency without reducing the overall accuracy.
We compared our system with six other models using two commonly used datasets in literature. Our results showed excellent performance, demonstrating the effectiveness of our approach.
Finally, we proposed a new dataset, named Global News 60 k (GN60K), composed of 60,000 news articles from different sources from different parts of the world with 10 topics in order to have a rich dictionary avoiding overfitting problems and creating a better-generalised framework. This is made available to the research community to test news topic classification algorithms.

The rest of the article is organised as follows: in Section II, we briefly review the existing approaches and techniques for topic classification. In Section III, we define the problem and its scope and propose a novel solution to address the automatic topic classification problem. Section IV describes the dataset proposed, while the model evaluation is described in Section V. Finally, the simulation results are presented in Section VI, and the article concludes with final remarks in Section VII.

SECTION II.

Related Works

This section provides a brief overview regarding the background of the topic classification process and then discusses which techniques are usually employed. Among all text classification mechanisms, this section is focused on the various approaches, works, and techniques concerning the classification of news topics.

Topic classification is a procedure that requires an input text and returns the most likely category that summarises the text with one word or a pair of words. The procedure can be named in several manners, e.g., topic analysis, topic extraction, or automatic labelling. However, all these procedures typically involve two main steps: the extraction and selection of features and the text classification. These steps are influenced by several parameters, such as the varying lengths (classified as short, medium, and long) [4], the writing style, the content type, and the used language. Indeed, short texts tend to have a limited amount of information compared to medium and long texts, with short texts being comprised of fewer than 40 words, medium-length texts typically ranging from 40 to 100 words, and long texts exceeding 100 words.

A. Feature Extraction and Selection

This step involves selecting essential features best suited for the specific classification task and their transformation into a numerical representation. A vector representation is necessary since many Machine Learning (ML) models can only understand vectors and numerical formats [5]. This step encompasses from simple techniques such as Bag Of Words (BOW) and term frequency-inverse document frequency (TF-IDF) to advanced techniques like word embeddings.

BOW is a simple method that represents text data as a list of unique words without considering their meanings. This list is then converted into a numerical format using, for instance, one-hot encoding, which assigns a unique number to each word in the vocabulary. However, one-hot encoding can be inadequate for text classification as it requires high computational resources, especially when the number of words in the vocabulary is large [6]. Another technique named TF-IDF weights words based on their frequency and how often they appear in the document allowing for a better representation of the text data and improves results compared to the previous methods. However, although TF-IDF provides better results, it still requires considerable computational effort due to creating a sparse matrix [7]. In addition, another common technique, named N-grams, is often used for text classification and language modelling tasks. Considering the sequence of words in a text, N-grams capture the local context of the text, which can provide important information for text classification and language modelling. However, the length of the N-grams directly impacts the required computational power, which can become an issue [8]. During the last years, the literature has been moving towards word embedding techniques, which represent advanced algorithms that consider both syntactic and semantic aspects of the word and express the cornerstone of the Natural Language Processing (NLP) models [9]. These techniques are essential for reducing the number of features in NLP models because they allow for the identification of semantic similarity among words. By representing words as vectors with similar numerical properties, words that are semantically similar can be treated as interchangeable features. This results in a more efficient representation of the data, which is useful when dealing with large text datasets.

B. Text Classification

Text classification refers to a set of techniques used to group data into categories or classes based on certain characteristics. These techniques can be categorised into three main approaches, i.e., unsupervised, supervised, and semi-supervised. The main difference lies in the presence, supervised, or absence, unsupervised, of a set of labels, which are used to train the algorithms that classify data or predict outcomes. Finally, semi-supervised approaches combine both supervised and unsupervised methodologies, by leveraging a small labelled dataset to guide the learning process while still utilising the larger unlabelled dataset to extract features and improve the overall accuracy of the model.

Among the unsupervised approaches, the literature proposes systems based on the well-known technique of Latent Dirichlet Allocation (LDA), which allows extracting subjects from a document and provides logical explanations among similarities of individual parts of the documents.

Most of the literature focuses on supervised approaches because it is easier to validate the system since the output can be compared with the initial labels. Among the many approaches in the literature, the most popular are Support Vector Machine (SVM), Naive Bayes (NB), Random Forest (RF), and K-Nearest Neighbors (KNN). These algorithms are widely used due to their effectiveness in various text classification applications. They work by learning patterns and relationships in the training data and then make predictions on new data based on that learning. Similarly, RF is effective at handling large datasets and has good generalisation ability, but it may be computationally expensive and may not perform well with highly correlated or imbalanced data [10]. On the other hand, KNN is an instance-based learning algorithm that works well with nonlinear and complex data but may be computationally expensive and may not perform well with large datasets or noisy data. Among the supervised approaches, various Deep Learning (DL) models have proven to be effective in solving tasks involving large amounts of textual data. These models work by learning high-level data representations and making predictions based on these representations. One of the critical advantages of DL models is their ability to handle large amounts of text data, which is a common challenge in text classification. For instance, one of the most used models for text classification is the Recurrent Neural Network (RNN) [11], and Long Short-Term Memory (LSTM) [12], which are designed to handle sequential data, making them well-suited for textual data. In addition, many works employ Convolutional Neural Networks (CNNs) [13] to exploit filters to the input data and extract local features and adjacent words relationships. The literature is deeply digging into various combinations of DL models obtaining, little by little, more accurate systems.

The task of text classification can be tackled using different approaches, but the supervised approach is commonly regarded as the most effective and easy to validate. A wide range of supervised techniques, including ML and DL algorithms, is capable of identifying patterns and relationships within the training data, which are then used to make predictions on new data. In contrast, unsupervised and semi-supervised approaches present several drawbacks, such as difficulty validating the system and reliance on the quality of the extracted features. Therefore, the supervised approach is commonly considered the preferred choice in the context of topic classification.

C. State of the Art on Topic Classification

In this section, we describe an overview of the works available in the literature, showing how they combine feature selection and classification techniques. From unsupervised work, Zheng et al. [14] employed the LDA technique to generate new topic labels from news articles and find the optimal topic number extracted. Another work, the system was evaluated with various classifiers, including NB, RF, KNN, and SVM, which was the one that showed the best results. In the same way, employing the BBC Dataset, Abhishek et al. [15] built a system to compare BERT and RoBERTa, exploiting the BBC news dataset already mentioned. In contrast, Alam et al. [11] took a different approach by utilising DL algorithms. They specifically evaluated the performance of CNNs, Artificial Neural Networks (ANNs), and BiLSTM. This last is a type of recurrent neural network that takes into account both past and future input when making predictions. Finally, the authors also considered hybrid approaches in categorising a set of textual Bangla news articles. The authors selected features using an embedding layer to convert each word into a vector form, feeding it into the DL models. Finally, they concluded that the performance was significantly better when using hybrid approaches, such as the combination of CNNs and BiLSTMs.

Significant work has been focused on embedding algorithms: first, Shah et al. [16] utilised BBC News, to classify news topics, by exploiting BERT for feature extraction and classification. Second, Kavitha et al. [17] employed another embedding algorithm called FastText, which is an open-source tool that enables the creation of an unsupervised or supervised learning algorithm to obtain vector representations of words. Then, they feed the output vector to a multichannel CNN to predict the category. A different classification was performed by Yogatama et al. [18]. In this research, the authors compared Generative and Discriminative models based on BiLSTM and found that the generative models outperform the discriminative models in topic classification tasks. To validate their system, they employed the widely used Ag News dataset, which has also been utilised in other works, such as Kumar Velu et al. [19]. In particular, they combined a CNN and an LSTM, focusing on short text, using a Caledonian crow optimization (N $C^{2}$ LO). A more recent architecture, well-known as Transformer, employs an Attention Mechanism (AM) making this model highly effective for text classification tasks. Among these models, BERT, RoBERTa and ALBERT have demonstrated exceptional performance, as evidenced by the work of Shi et al. [20]. In their study, the authors employed a combination of RoBERTa and a BiLSTM Network to classify topics in two datasets, namely THUCNews and Shopping Review. Another work conducted by Mandal et al. [21] employed Trasformers-based architectures, specifically ALBERT and RoBERTa, to classify short text data, i.e., tweets, into various topics using a custom dataset. Transformer models excel in topic classification through their attention mechanisms, which enable to capture complex relationships between words, gaining understanding of semantic. However, the literature on topic classification also includes generative Transformer-based models, such as Generative Pre-trained Transformer (GPT). Although these models are primarily designed to generate human-like text, they can be fine-tuned for classification tasks. In a recent study, Saeed et al. [22] demonstrates how GPT-2 outperformed all the ML models employed in their work, e.g., LSTMs and CNNs. Other examples of generative Transformer-based models are GPT-3.5, GPT-4 or Llama, also known as Large Language Models (LLMs) since they are trained on billions and trillions of parameters. Finally, Milios et al. [23] achieve excellent results using the LLM Llama for classification purposes without fine-tuning the model, and even surpassing fine-tuned ones.

In summary, previous studies have investigated the use of DL models for topic classification tasks, accomplishing remarkable results, nonetheless, with some limitations: some studies have only considered the most important keywords of the input data, neglecting valuable contextual information that could be crucial for accurate classification. Another limitation is that some models may not perform well on short texts, which are becoming increasingly common in modern applications. To overcome these limitations, we propose a novel approach that incorporates a pre-processing step using a Key Sentence Extractor (KSE) algorithm, reducing the input document's volume without sacrificing important information. This allows our framework to handle both medium and long-length texts as well as short texts, improving the overall performance of the classification system. Furthermore, we use a BiLSTM followed by an Attention layer to produce the final classification. Our approach has shown promising results, outperforming previous state-of-the-art models on several benchmark datasets. To provide a comprehensive overview of the related works, we have summarised the previous studies in Table 1.

TABLE 1 Topic Classification Works - Summary of Approaches, Features, and Datasets

SECTION III.

Proposed Solution

In this section, we first mathematically define the problem and then provide an overview of the processes to reach the results presented in Section VI. The process has been split into three distinct phases: Key Sentence Extractor (KSE), word embedding, and finally, the topic classifier.

A. Problem Definition

The automatic classification branch encompasses various forms of data, including video, audio [25], images [26], and text. Among them, our article focuses on classifying textual data, specifically news articles on the web. In our modelling, each document $d_{i}$ is decomposed into sentences, defined as a set ${\mathcal {S}}_{i} = \lbrace s_{i1},s_{i2},\ldots,s_{iy},\ldots,s_{iM_{i}}\rbrace$ , where $M_{i}$ represents the number of sentences in the $i$ -th document. Each sentence $s_{iy}$ consists of a set ${\mathcal {W}}_{iy} = \lbrace w_{iy1},\ldots, w_{iyz},\ldots, w_{iyZ_{iy}}\rbrace$ of $Z_{iy}$ words, where the words belonging to the document can be represented by the set ${\mathcal {W}}_{i} = \bigcup _{y=1}^{M_{i}} {\mathcal {W}}_{iy}$ . Additionally, we define the set of keywords in the sentence $s_{iy}$ as the set ${\mathcal {K}}_{iy} = \lbrace k_{iy1},\ldots, k_{iyq},\ldots, k_{iyQ_{iy}}\rbrace$ , where for each document $d_{i}$ , and sentence $s_{iy}$ , the total number of keywords $Q_{iy}$ is at most equal to the number of words $Z_{iy}$ , i.e. the condition $Q_{iy}\leq Z_{iy}$ is always satisfied. It is worth noting that the strict condition is necessary to ensure that the number of selected keywords does not exceed the number of words in the sentence while allowing for the possibility that all words in a sentence could be considered keywords.

The goal of the article is to address two problems: to assign a topic to an unlabelled text corpus and to remove irrelevant information from the text. Regarding the first problem, the potential range of topics for news articles is unlimited; our supervised approach considers a limited set of $J$ topics ${\mathcal {Y}} = \lbrace y_{1}, y_{2},\ldots, y_{j},\ldots, y_{J}\rbrace$ , with each document being assigned to a single topic.

Problem 1 - Topic Assignment Given a generic document $d_{i}$ , we aim to retrieve the topic $y_{j} \in {\mathcal {Y}}$ that maximises the likelihood $P(y_{j}|d_{i})$ . $\begin{equation*} \forall d_{i} \:\exists ! \;\boldsymbol{y}_{\boldsymbol{j}}\in {\mathcal {Y}} : \max _{y_{j} \in {\mathcal {Y}}}{P(y_{j}|d_{i})} \end{equation*}$ View SourceThe second problem involves synthesising a document into a shorter version to optimise resource consumption during analysis. This effort becomes particularly significant for medium and long texts, as their complexity makes them more resource-intensive to process.

Problem 2 - Document Summarisation Given a document $d_{i}$ belonging to the topic $y_{j}$ , with a set of sentences ${\mathcal {S}}_{i}$ , we aim to retrieve a subset ${\mathcal {S}}_{i}^* \subseteq {\mathcal {S}}_{i}$ , such that $M_{i}^* \leq M_{i}$ , and $d_{i}$ still belongs to the topic $y_{j}$ .

Fig. 1 describes the different phases of the system, which begins with the KSE module that extracts the most informative parts of the document while removing non-contributory content. The minimized document is then processed to be transformed into a numerical document representation. Finally, the document is fed into a classifier module that determines the most likely topic for the input document. Each of these phases is explained in detail in the following sections. Before delving into each component, Table 2 summarizes the main parameters of our framework, from document-processing related parameters to classification granularity.

TABLE 2 Main Parameters of the Proposed Framework

Figure 1.

System Architecture Flow.

Show All

B. Key Sentence Extractor

The Key Sentence Extractor (KSE) process refers to a technique for synthesising a text document and obtaining a meaningful subset of sentences with great importance for the document's meaning [27]. The goal of this phase is twofold: first, it aims to reduce the number of features in order to cut down the learner's computational effort, i.e., memory saturation and computing time. Second, removing hindering sentences, i.e., those sentences with fewer keywords that do not contribute to determinate the topic, and, being insignificant, could induce the classifier to misclassify. Thus, the KSE algorithm aims to extract a subset of sentences ${\mathcal {S}}_{i}^{*}$ , where $M_{i}^*\leq M_{i}$ . The relative importance of each sentence is determined by the number of keywords it contains. Therefore, the first step of the KSE algorithm is to extract the set of keywords ${\mathcal {K}}_{i}$ from the document $d_{i}$ . We define the keyword density $\lambda _{iy}$ in a sentence as the ratio between the number of keywords and the total number of words, i.e., $\lambda _{iy} = Q_{iy} \big / Z_{iy}$ . Thus, the keyword density represents the relative frequency of keywords in a sentence. The KSE algorithm selects the significant sentences of the document, i.e., those sentences where $\lambda _{iy}\geq \lambda _{TH}$ , with $\lambda _{TH}\in [0,1]$ representing the threshold for which a sentence is considered significant. The output document $d_{i}^*$ includes then a subset of sentences ${\mathcal {S}}_{i}^*$ and words ${\mathcal {W}}_{i}^*$ ; however, in cases where a demanding $\lambda _{TH}$ is used, the new set of sentences ${\mathcal {S}}_{i}^*$ may be empty, i.e., no sentences are selected. To handle such situations, the KSE algorithm incorporates a strategy to ensure that at least one sentence is selected, even if the keyword density threshold is unmet. In this particular case, the sentence with the maximum $\lambda _{iy}$ is selected.

C. Word Embedding

The embedding phase lets the system know the word's context within the input document, offers a methodology for converting the document from a natural language to a numeric one and allows the system to deal with synonyms due to the functionality of considering the semantic meaning of the words. In this way, the meaning of the words is appropriately weighted, such that the words with similar meanings are supposed to be closer rather than words with a different meaning [28].

The synthesised document $d_{i}^*$ , which now consists of a set of ${\mathcal {W}}_{i}^*$ words, also named word tokens, is then the input of the embedding module, which returns an output constituted as a matrix $E_{i} \in \mathrm{I\!R}^{|{\mathcal {W}}_{i}^*| \times \rho }$ , where $\rho$ corresponds to the word embedding array dimension that is often decided by means of the rule of thumb [29]. The word embedding phase can handle documents of varying sizes; however, the feature space used to represent and analyse the documents needs to be consistent in length across all documents. As a result, the embedding phase grants a fixed-length input structure for each text corpus, which must be tuned according to the machine's resources, since it causes saturation memory and data sparsity problems. To this, we choose a static number of word tokens $\alpha$ for all the documents, which sets the maximum text length that the framework can handle. In particular, if $|{\mathcal {W}}_{i}^*|>\alpha$ , i.e., there are more words that can be handled, the text will be truncated. It is important to note that the KSE algorithm not only helps to reduce data volume but also to mitigate the impact of information loss due to text truncation since only important sentences are used as input to this phase, and then $|{\mathcal {W}}_{i}^*|\leq |{\mathcal {W}}_{i}|$ . In contrast, if $|{\mathcal {W}}_{i}^*| \leq \alpha$ , the resulting matrix will be filled with padding of zeros with a size $[(\alpha -|{\mathcal {W}}_{i}^*|) \times \rho ]$ in order to make all the documents' sizes with the same dimension. Therefore, the matrix $E_{i}$ has been represented in (1) as a transposed row vector where each component indicates an embedded token, where $\mathbf {e}_{i\sigma }$ is the token, expressed in the form of a vector representing the generic word, and $|\mathbf {e}_{i\sigma }| = \rho$ . $\begin{equation*} E_{i} = \begin{bmatrix}\mathbf {e}_{i1} & \mathbf {e}_{i2} & \dots & \mathbf {e}_{i\sigma } & \dots & \mathbf {e}_{i\alpha } \end{bmatrix}^\intercal \tag{1} \end{equation*}$ View Source

D. Topic Classifier

With a supervised approach, the proposed topic classifier aims to assign one of $J$ predefined topics to a given document. In order to achieve this, we employ DL techniques to learn the correlation between the input document, represented in a matrix form, and the corresponding output topic. As depicted in Fig. 2, the architecture consists of three blocks connected in succession: a BiLSTM and a Self-Attention Layer followed by a fully connected layer. The input document is represented as an embedded matrix $\alpha \times \rho$ , where $\alpha$ is the document length and $\rho$ is the embedding dimension. The BiLSTM layer comprises two stages, forward and backwards LSTM, which enhance the comprehension of the framework by allowing it to determine the nature of the embeddings associated with words that come before or after a specific word in a sentence. We aim to retrieve the two components of the hidden state $\lbrace \overrightarrow{h_{i\sigma }},\overleftarrow{h_{i\sigma }}\rbrace$ , which correspond respectively to the forward and backwards hidden states, expressed as: $\begin{align*} \overrightarrow{h_{i\sigma }} &= f \left(\overrightarrow{\boldsymbol{\omega }}_{xh}\mathbf {e}_{i\sigma } + \overrightarrow{\boldsymbol{\omega }}_{hh}\overrightarrow{h}_{i\sigma -1} + \overrightarrow{b}_{h}\right) \tag{2} \\ \overleftarrow{h_{i\sigma }} &= f \left(\overleftarrow{\boldsymbol{\omega }}_{xh}\mathbf {e}_{i\sigma } + \overleftarrow{\boldsymbol{\omega }}_{hh}\overleftarrow{h}_{i\sigma -1} + \overleftarrow{b}_{h}\right) \tag{3} \end{align*}$ View Sourcewhere $\boldsymbol{\omega }_{xh}$ is the weight matrix that connects the input layer and the hidden layer, $\boldsymbol{\omega }_{hh}$ is the weight matrix among two consecutive hidden states, and finally, $b_{h}$ and $f$ are the bias vector of the hidden layer, and the non-linear activation function, respectively. The output of BiLSTM is performed by concatenating the aforementioned hidden states as $h_{i\sigma }=[\overrightarrow{h_{i\sigma }},\overleftarrow{h_{i\sigma }}]$ , which represents the input for the self-attention layer. The self-attention layer is a key component of our proposed architecture, as it applies a self-attention mechanism to calculate a weighted sum of the BiLSTM outputs. This allows the framework to selectively focus on specific parts of the input sequence, weighing the importance of different tokens and improving its ability to make predictions. This layer is based on dot-product attention, where the attention weight, expressed as $\delta _{\sigma \tau }$ , indicates how much the $\sigma$ -th token depends on a generic token $\tau$ in terms of semantic similarity and can be computed using the following equation: $\begin{align*} \delta _{\sigma \tau } = \frac{\exp (h_{i\sigma }^{\intercal } h_{i\tau })}{\sum _{\theta =1}^{\alpha } \exp (h_{i\sigma }^{\intercal } h_{i\theta })} \tag{4} \end{align*}$ View Sourcewhere $h_{i\sigma }^{\intercal } h_{i\theta }$ is the dot product between the $\sigma$ -th and the $\theta$ -th token's BiLSTM output, which are denoted by $h_{i\sigma }$ and $h_{i\tau }$ respectively, and $\intercal$ denotes the transpose operation. The dot product measures how much information is shared between the two tokens. The attention weight $\delta _{\sigma \tau }$ is then obtained by normalising the dot product using a softmax function, which ensures that the sum of all attention weights for each token is equal to one. The self-attention layer then uses the attention weights to compute a new representation for each token, which incorporates information from both directions. The new representation for the $\sigma$ -th token is calculated as a weighted average of all tokens' BiLSTM outputs, using the attention weights as coefficients: $\begin{equation*} o_{i\sigma } = \sum _{\tau =1}^{\alpha } \delta _{\sigma \tau } h_{i\tau } \tag{5} \end{equation*}$ View Sourcewhere $o_{i\sigma }$ is the new representation for the $\sigma$ -th token in the embedding matrix $E_{i}$ , and $\alpha$ is the number of tokens in $E_{i}$ . The new embedding matrix is finally passed to a dense layer, which is responsible for learning nonlinear relationships between input and output data by applying weights and biases to the inputs and passing them through an activation function. Then, the most likely topic is computed by a softmax function that generates a distribution over the class labels.

Figure 2.

Proposed architecture with BiLSTM, Attention Mechanism, and a fully connected layer.

Show All

SECTION IV.

Proposed Dataset

The limited availability and diversity of datasets in current research is a persistent issue that often leads to generalisation problems. Existing datasets tend to cover only a few topics and are restricted to a limited number of news sources. To address this gap, we propose a new dataset that offers a broader range of topics and sources, aiming to enhance the generalisability of research findings. This section introduces the proposed dataset's creation and description, which are detailed in the following subsections.

A. Dataset Creation and Description

Our proposed dataset was created by carefully selecting news articles from various regions, including America, Europe, and Australia. This broad geographical coverage ensures a diverse range of writing styles, expressions, and perspectives from multiple authors. To ensure the dataset remains up-to-date and relevant, we added more recent news articles incorporating the latest terms and events, including the global pandemic, “COVID”. The dataset creation process began with the U.K. assessment described in [30], highlighting 12 potential topics as the most popular among readers and sharers. We excluded topics with strict geographical connections and separated the science and technology categories. This refinement resulted in a final list of 10 output categories. Using Web scraping tools, we extracted 60,000 news articles covering the selected topics, with an average of 6,000 news articles per topic to maintain balance in the dataset. The news articles were scraped from distinct news publishers to ensure a good heterogeneity of authors' writing styles and word usage across different geographical zones.

Table 3 displays the diverse sources utilised in the dataset, underscoring each source's name, country, and the range of topics procured from them. Spanning articles from 2022 to 2023, the dataset captures a snapshot of contemporary news stories. The dataset has been meticulously curated to approach a good balance among the sources. The variation in representation between the majority and minority classes is denoted by a value of 0.17, which reflects the degree of balance across the dataset. This value indicates a low level of disparity, contributing to the dataset's robustness and reliability for developing ML models that require balanced data inputs. The dataset has been made available for testing purposes here.¹

TABLE 3 List of Sources Employed to Construct the Dataset

SECTION V.

Experimental Setup

To evaluate the effectiveness of our proposed solution, we utilised standard news article datasets commonly employed in topic classification research. In addition to the novel dataset presented in Section IV, we describe the structures of the two other datasets used in our experiments. This allows for a comprehensive comparison of our approach with existing methods.

A. Dataset Analysis

In order to test the proposed solution, we employed standard datasets of news utilised in the field of topic classification research. This section describes the structures of the employed datasets:

Ag News [31]. This dataset contains 127,568 news: 120,000 for the training side and 7,568 news for the test. The dataset is structured in comma-separated values (CSV) format with three attributes: title, description, which corresponds to the news corpus, and the topic index for a total of 4 output categories, i.e., world, sports, business, and science-technology.

BBC News Dataset [32]. It is a set of news from BBC which contains 2,225 news articles divided into 5 topics, i.e., business, entertainment, politics, sport, and tech.

To evaluate our approach rigorously, we analysed four widely-used datasets: AG News [31], BBC News [32], Reuters [33], and 20NewsGroups [34]. Our analysis focused on two key complexity metrics: interclass similarity and Flesch-Kincaid readability score. Interclass similarity measures the semantic overlap between categories within a dataset, calculated using TF-IDF vectors and cosine similarity between the aggregated category documents [35]. Higher values indicate more challenging classification tasks, as the boundaries between categories become less distinct. The Flesch-Kincaid score assesses text complexity based on sentence length and word difficulty, with higher scores indicating more complex text [36]. The analysis results, shown in Table 4, revealed that AG News and BBC News present substantial complexity: AG News shows an interclass similarity of 0.539 and a Flesch-Kincaid score of 15.06, while BBC News exhibits an interclass similarity of 0.466 and a Flesch-Kincaid score of 11.30. These values are comparable to or exceed those of larger datasets, Reuters (0.143 similarity, 11.27 Flesch-Kincaid) and 20NewsGroups (0.259 similarity, 9.00 Flesch-Kincaid). This suggests that despite having fewer categories, AG News and BBC News provide sufficiently challenging test cases for evaluating News topic classification systems. In this context, the proposed GN60 K dataset introduces additional complexity with an interclass similarity of 0.564, the highest among all analysed datasets. With its intermediate number of categories (10), GN60 K provides a balanced benchmark between category granularity and classification difficulty. It's worth noting that dataset complexity should also be evaluated in relation to the specific tasks being tested. In our case, beyond classification accuracy, we aim to validate the effectiveness of our KSE algorithm, making document length a crucial factor. The selected datasets offer a comprehensive spectrum of text lengths, from short to long, enabling thorough evaluation of KSE's ability to reduce data volume while maintaining classification accuracy across different document scales.

TABLE 4 Comparison of Dataset Characteristics

B. Simulation Setup

A pre-processing phase was conducted on the three selected datasets. In particular, each news article has been cleaned from special characters, and stop words, which do not significantly impact the topic selection. Moreover, we focused on the morphological structure of words. Specifically, we considered the inflectional variations of words, such as the transition from singular to plural forms or the conjugation of verbs in different tenses. For this purpose, we used known techniques, such as stemming and lemmatization, to alter the text corpus and to unify different terms with similar semantic meanings. For instance, the word “scientists” is transformed to “scientist”, and the verb “researched” can be converted to “research”. Afterwards, the KSE algorithm was applied by varying the parameter $\lambda _{TH}$ for each dataset. Regarding the keyword extraction algorithm, YAKE [37] was adopted as a keyword extraction tool due to its large diffusion in the context of online news articles [38].

Furthermore, the threshold $\lambda _{TH}$ has been analysed during the tests and discussed in Section VI in order to find the perfect match between the system complexity and the accuracy performance. Afterwards, BERT [39] was employed for converting the text to a vector form of $\rho =768$ components, which corresponds to how the original paper implemented the system. In particular, to avoid memory saturation problems, each document has been truncated by setting the parameter $\alpha$ equal to the mean average length of the news articles included in the training set, as suggested in [40]. The output of the embedding system, i.e., the matrix $E_{i}$ , related to the synthesised document $d^*_{i}$ , is fed to the BiLSTM layer for catching language dependencies from both before and after the token. Then, the output is employed as input for a dense layer comprised of 512 neurons with a rectified linear unit (RELU) activation function. Finally, a softmax function was necessary to convert the learned input values into a range between 0 and 1, which is used to calculate the probability for each class. This stage has employed a number of neurons equal to $J$ , which corresponds to the output classes of the dataset, i.e., $J= 4$ for the AG News dataset, $J=5$ for the BBC dataset, and $J=10$ for our dataset GN60K.

SECTION VI.

Simulation Results

We evaluate the performance of the proposed system by analysing its performance under the three identified datasets. The experiments are conducted employing a computer equipped with a processor Intel Core(TM)i9-7980XE CPU @2.60 GHz, 64 GB RAM, and an Nvidia GeForce GTX 1080Ti graphic card having 11 GB of memory. In addition to the hardware setup, we employed the TensorFlow library and the Hugging Face Transformers library. We study how the KSE algorithm affects and enhances system performance. Finally, we compare the proposed method with state-of-the-art topic classification methods to understand the effectiveness capabilities of the proposed architecture.

A. Classification Results

We employed standard metrics, including precision, recall, F1-score, and accuracy, to evaluate the system's effectiveness in classifying news articles into single topics. Our work mainly focuses on the accuracy metric for several reasons. First, related works often report accuracy measures, allowing us to benchmark our model and compare it with other related works. Second, accuracy provides a direct and clear indicator of the model's ability to identify the topic, particularly for single-label topic classification tasks.

For each dataset, we compute the mentioned metrics associated with each topic using the KSE algorithm with the best $\lambda _{TH}$ value. The classification results for the three selected datasets are presented in Table 5. The BBC News outcomes are achieved on the test set, including 735 news articles. The average accuracy is reached using $\lambda _{TH}=0.15$ , resulting in an accuracy of 0.997. Precision, Recall, and F1-Score show well-balanced values. However, these results were expected, since the dataset, although not extensive in terms of size, comprised news articles from the same source which may have resulted in less variability in terms of writing style and vocabulary usage compared to datasets encompassing articles from multiple sources. The classification results for the GN60 K dataset refer to a random split of the dataset into training and testing sets at an 80/20 ratio. The Precision, Recall, and F1-Score values for each category demonstrate the effectiveness of the KSE algorithm in accurately classifying news articles, and achieving a remarkable accuracy is 0.945. The Science category achieved the highest scores for all three metrics, with values of 0.986, 0.981, and 0.984 for Precision, Recall, and F1-Score, respectively. This indicates that the framework excels at identifying and classifying science-related news articles due to the distinct set of keywords and terminology associated with this topic. In contrast, the Politics category had the lowest Recall value of 0.838, suggesting that the framework had difficulty detecting political news articles due to their wide range of themes and subtopics. Finally, the classification outcomes for the AG News dataset are performed employing the standard test set with 7,568 news articles. The average accuracy score of 0.9455, with the Business category that shows the lowest scores among the four categories, and consistent with the results observed in the GN60 K dataset. In contrast, the Sports category achieved the highest Precision, Recall, and F1-Score values, with an Accuracy score of 0.9904. Instead, the Sci/Tech category has minor results, indicating that merging Tech and Science into a single category may have led to less favourable outcomes. This is consistent with the results obtained in the GN60 K dataset, where the class Tech did not excel.

TABLE 5 Classification Results

B. Efficiency of the KSE Algorithm: Accuracy and Training Time Cost

This study aims to assess the KSE algorithm's effectiveness in reducing training and testing time costs while improving classification accuracy. To achieve this, we analysed the distribution of sentence percentages in the three selected datasets as a function of the $\lambda$ parameter. For each news item, we extracted relevant keywords and calculated the $\lambda$ value for each sentence. We then generated the histogram in Fig. 3 depicting the distribution of sentences and their corresponding $\lambda$ values using a quantisation step of 0.05. To enable a direct comparison between the three datasets, the height of the bars is normalised based on the total number of sentences in each dataset.

Figure 3.

Sentence percentage distribution.

Show All

The analysis of the BBC News and GN60 K datasets reveals a higher density of sentences with $\lambda$ values near zero, i.e., most sentences have only a few keywords. In contrast, the Ag News dataset exhibits a distinct distribution. Due to the shortness of news articles, the distribution of $\lambda$ values peaks at higher values, with a maximum at $\lambda =0.8$ , which highlights how 80% of the words are considered keywords.

Fig. 4 displays the datasets' accuracy obtained with different values of $\lambda _{TH}$ (black solid lines). The remaining curves (dashed lines) depict the training and testing time on a semi-logarithmic scale. For the BBC News and GN60K datasets, optimal $\lambda _{TH}$ values fall within the 0.1 to 0.15 range, where the highest accuracy values are achieved. Within this range, accuracy consistently exceeds the value obtained without using the KSE algorithm (i.e., for $\lambda _{TH} = 0$ , where the original dataset is used), indicating that KSE effectively eliminates non-contributory informational sentences from news articles. However, accuracy decreases for higher threshold parameter values as the KSE algorithm eliminates important sentences. This is due to the fact that for $\lambda _{TH}=0.25$ , if compared to $\lambda _{TH} = 0$ , the remaining informative content is approximately 36% and 28%, for GN60 K and BBC News respectively.

Figure 4.

Accuracy and Efficiency Trade-off.

Show All

Moreover, we observe a significant improvement in training and testing times for GN60 K and BBC News as $\lambda _{TH}$ increases. Additionally, selecting a threshold value $\lambda _{TH} \in (0.1, 0.15)$ leads to a 34% improvement in training time and a 66% improvement in testing time for the GN60K. While using the same threshold range, an 82% improvement in training time and a 63% improvement in testing time are observed for the BBC News dataset. It is important to note that the testing times for the three datasets may differ due to variations in news sizes (e.g., long, medium and short). While these variations exist, our primary objective is to demonstrate how the KSE algorithm effectively reduces computational overhead for longer documents. To ensure rigorous evaluation of KSE's benefits, we conducted our testing using an identical number of articles (2000) across all datasets. This standardized testing approach allows us to quantify the computational advantages of KSE while maintaining experimental consistency. Concerning the AG News dataset, the news articles are generally shorter than those in other datasets. This results in a higher concentration of informative sentences for increasing $\lambda _{TH}$ values, as illustrated in Fig. 3. However, this can also make it more difficult for the KSE algorithm to distinguish between informative and non-informative sentences, reducing the effectiveness of the threshold parameter in eliminating non-contributory information. The performance of the proposed system is almost the same independently of the chosen $\lambda _{TH}$ value and is comparable with state-of-the-art approaches. However, given the short news length, the testing time is still comparable with the other two approaches, thus indicating that for short news articles, there is no need to reduce the feature space, and it is better to analyse the news in its entirety. This finding underscores a key strength of our framework: its ability to automatically adapt its processing strategy based on document length. For short texts like those in AG News, the framework effectively preserves the entire content, recognising that dimensionality reduction would not provide meaningful benefits. This adaptive behaviour ensures optimal performance across varying document lengths while maintaining computational efficiency. Finally, the model is tested in terms of inference time. In this sense, it is important that the inference time is less than the article reading time, enabling the model to classify the text even before the reader does. Specifically, the inference times for all the types of documents described in Section 2, namely small (S) documents containing an average of 20 words, medium (M) documents with 60 words, and large (L) documents with an average of 200 words, are significantly less than the average reading time. Considering the average silent reading time for an adult user is about 230 words per minute [41], the average inference time of our proposal, which is approximately 0.011 ms, is much less than the 5.22 s, 15.65 s, and 52.17 s required to read an S, M, and L documents, respectively.

C. Comparison With Existing Works

In order to provide a fair and accurate comparison, we have selected studies that utilise the exact portion of the BBC News and AG News datasets as baseline comparisons, ensuring comparable and consistent data. Therefore this section compares our approach with the best models presented in the literature. The results of our proposed method and the comparison works are presented in Table 6. The tests are performed using the same dataset and evaluation metrics, with a primary focus on comparing accuracy, which is the central metric of this proposal. In this sense, the past subsection demonstrated that the training and inference times do not impact the user's reading time of an article. Therefore, these factors do not influence the topic classification of the document.

TABLE 6 Works Comparison on BBC News and AG News

In the first study, we compare the top 3 most promising works in literature that employed the BBC News dataset for topic classification. Kumar et al. [15] achieved an accuracy of 99.1% using the model RoBERTA, and performing transfer learning. Shah et al. [16] utilized the transformers BERT to classify English text, achieving an accuracy of 89.1%. Kavitha et al. [17] proposed a CNN and an LSTM, achieving an impressive accuracy of 99.97%, using BBC News, and 87.76% using Ag News. While these studies demonstrate competitive accuracy rates, our framework is able to outperform them all. We calculated the accuracy across all sets and topics to obtain an overall accuracy rate of 99.7%.

Our second study focuses on the Ag News dataset, which has been used to evaluate various models in the literature. We compare the top three works in literature that achieved the highest accuracy rates on this dataset. Kumar Velu et al. [19] explored a supervised approach, achieving an accuracy of 92.67%, while Waly et al. [24] obtained an accuracy of 89.91% utilising different embedding models and classifiers. Our framework outperforms all these works with an accuracy of 94.55%, with the highest performance achieved in the sports class at 97.2%. These results demonstrate that our work is effective in classifying news with different text lengths and in extracting contextual information.

D. Comparison With Large Language Models

Given the high performance of LLMs in various NLP tasks, we extended our experiments to analyse if they can surpass the performance of our model in topic classification. To this, we conducted a comparative analysis between our system and TinyLlama,² a model that is smaller in scale compared to the leading models in the field. Our study concentrated on a use case involving detecting fake news and following searches for its factual counterpart based on the topic. We ran both models on an identical computational setup to ensure a fair comparison, thereby providing equivalent processing power. Our results showed that TinyLlama achieved an average inference time of 45 seconds per news, with 82% accuracy on the test portion of the BBC dataset. This suggests that our solution is more adept at handling the aforementioned use case. To enrich our study further, we also tested the GPT-3.5. This step was taken to evaluate the performance of a larger and more advanced LLM, breaking our computational limitations and focusing exclusively on accuracy. To evaluate our system's and ChatGPT's performance in classifying news topics, we used the same BBC News test part that was also employed by TinyLlama. Utilizing the OpenAI API, we asked ChatGPT to classify each news item in the dataset into one of the five topics. Our system achieved an accuracy of 99.7%, while ChatGPT obtained 92%. Table 7 provides a comprehensive comparison of performance metrics between our framework and the tested LLMs. These results suggest that our system is more suitable for topic classification tasks than LLMs, which are more general-purpose. We acknowledge that LLMs' capabilities extend beyond topic classification. However, our study focused on comparing these models performance against our system in this specific task.

TABLE 7 Performance Comparison With Large Language Models

SECTION VII.

Conclusion

This article presents a new topic classification model for news articles and a new dataset of 60 000 news articles from different sources all over the world to improve generalisation.

The proposed system implements a sentence extractor mechanism that aims to decrease the data dimensionality without impacting the model's accuracy. Moreover, the system is able to classify the topic for news with different text lengths correctly. In particular, the system has been tested under three different datasets and compared with state-of-the-art models. Experimental results prove that the implemented system is able to outperform well-known models, and the proposed key sentence extractor is able to reduce the testing time for medium and long texts significantly.

NOTE

Open Access funding provided by ‘Università degli Studi di Cagliari’ within the CRUI CARE Agreement

This article includes datasets hosted on IEEE DataPort^(TM), a data repository created by IEEE to facilitate research reproducibility or another IEEE approved repository. Click the dataset name below to access it on the data repository

Dataset Name: Global News 60K

References is not available for this document.

Reducing Data Volume in News Topic Classification: Deep Learning Framework and Dataset

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction