Semi-Automatic Classification and Duplicate Detection From Human Loss News Corpus

Automatic news repository collection systems involve a news crawler that extracts news from different news portals, subsequently, these news need to be processed to figure out the category of a news article e.g. sports, politics, showbiz etc. In this process there are two main challenges first one is to place a news article under the right category of news, while the second one is to detect a duplicate news, i.e. when the news are being extracted from multiple sources, it is highly probable to get the same news from many different portals, resulting into duplicate news; failing to which may result into inconsistent statistics obtained after pre-processing the news text. This problem becomes more pertinent when we deal with human loss news involving crime, accident etc. related news articles. As the system may count the same news many times resulting into misleading statistics. In order to address these problems, this research presents the following contributions. Firstly, a news corpus comprising of human loss news of different categories has been developed by gathering data from different sources of well-known and authentic news websites. The corpus also includes a number of duplicate news. Secondly, a comparison of different classification approaches has been conducted to empirically find out the best suitable text classifier for the categorization of different sub-categories of human loss news. Lastly, methods have been proposed and compared to detect duplicate news from the corpus by involving different pre-processing techniques and widely used similarity measures, cosine similarity, and Jaccard’s coefficient. The results show that conventional text classifiers are still relevant and perform well in text classification tasks as MNB has given 89.5% accurate results. While, Jaccard coefficient exhibits much better results than Cosine similarity for duplicate news detection with different pre-processing variations with an average accuracy of 83.16%.


I. INTRODUCTION
News similarity computation aims to check the similarity of two news with each other, it mainly relies on the text of news stories. Data on the social media such as social networking websites and media news websites is increasing with a rapid speed in this present era. While in many big cities of the world crime control is a serious issue, and there are agencies which compute safety and security index for different cities of the world [1]. These indices are built using different useful statistics, whereby news reports play an important role in the The associate editor coordinating the review of this manuscript and approving it for publication was Guanjun Liu . collection of events disturbing the law-and-order situation of a city or detecting a crime-pocket in a city. Now, in this scenarios, when the news stories are extracted from different news portals there are chances that one news may be coming from multiple sources. These type of news articles may be referred to as same day duplicate news articles coming from multiple sources. Figure 1 presents a snapshot of similar news articles which are published on different such duplicate news stories.
On the other hand, there would be some news articles which could be more severe or have major impact on the society in terms of human loss. Generally, we find follow-ups of such news stories, and it is a challenge to distinguish them VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ as follow-up news. Figure 2 shows the screenshots of a news story with its two follow-up stories in different news portals on different dates. Figure 2(a) is a screenshot of news related to terrorist attack of blast on mosque during Friday prayers on 19 August, 2011. According to this news article 47 people were killed and 70 people were injured, this news article is crawled from Dawn news website. While Figure 2(b) is a screenshot of follow-up news of Figure 2(a) which is taken on 23 August, 2011. This news is crawled from CNN website. According to CNN the number of dead people is 50 and injured people are 117. The third screenshot shown in Figure 2(c) is follow-up news taken from Tribune Express website on 29 August, 2011. According to this article number of dead people is 56 while number of injured people is 123. It can be observed from these screenshots that the statistics shown about dead and injured have also been changed in the follow-up news, though the news is related to same incident which is blast attack on mosque in Jamrud, Pakistan, while people were performing Friday's prayer on 19 August, 2011.
Such news stories including accidents, crime, terrorism, or natural disaster related news are directly related to human loss information and these may help building a security index of a city. This index may be a number which can be calculated by using number of incidents directly or indirectly related to human security, occurred at a given place. The above examples show the importance of detecting duplicate news properly, otherwise a single event may be counted multiple times. If we consider all the event articles differently and do not maintain correct statistics of news then value of security index will be exaggerated and reliability of our system will drop. So it is very important to detect correct duplicate news articles and maintain correct statistics of these correctly triggered duplicates.
Similarly, different news stories belong to different categories and each category may impact the security index differently. Thus it is pertinent to categorize the news reports rightly in the appropriate sub-category of human loss news.
Main contributions of this research article are the generation of news corpus using a systematic approach. This corpus comprises of different categories of human loss related news stories having both types of duplicate entries discussed earlier. The collected news are not only tagged as duplicate but are also classified onto appropriate sub-category of human loss news. Secondly, an empirical comparison of widely used conventional text classification approaches has been performed on the collected news to figure out the best classifier for this purpose. Lastly, best approach to detect duplicate news stories has been identified by involving different variants of pre-processing steps and similarity measures.
Rest of this research work is organized as follows: section II give a brief review on the researched related to this area which uses different algorithms and different techniques for pre-processing, news classification and news similarity. Section III gives detail of data collection and corpus generation. Section IV is a brief description of methodology which is consist on classifiers used for classification in this article and text similarity models with preprocessing techniques. Section V explains the dataset formation for testing. Section VI discusses the experimental setup and results. Lastly, the article is concluded with discussions in Section VII.

II. RELATED WORK
It has been widely studied about the string based and semantic based similarity of text documents. These approaches are also used for news processing in order to get some important and useful information from the bulk of news documents. Many researchers have struggled from number of years to find efficient and accurate criterion for semantic/lexical similarities between two words or two documents of text. In text similarity the major portion is pre-processing because a text coming from news sources may contains a lot of irrelevant and useless information which can slow down any text similarity system. So that related study is divided into three portions i.e. text pre-processing, news classification and news text similarity.

A. PRE-PROCESSING
The most important step in pre-processing for text similarity is removal of html tags, diacritic words, stop words and stemming from the text so that computation process could be made faster. Stop words are those terms which have maximum term frequency TF in any text document which are useless because by using stop words we cannot check the accuracy of computation model for similarity. These words have no impact on semantic knowledge of a text. It is determined that the list of stop words can automatically be generated by using Zipf's law but the process is very expensive in the sense of time and memory consumption [2] that's why we can't use such type of stop words list which will be created dynamically at runtime. So the work in this paper is done by using list of already built-in stop words.
Most of the words in text sometimes have same grammatical roots such as loved, lovely and loveable are generated by single word that is Love. This mechanism is called stemming. Some of researchers worked on improvement of space and time complexity of M.S porter stemming algorithm by applying some constraints on it which we used for stemming [3]. Such type of work has also applied on Arabic news texts preprocessing which is done by applying two major steps, first is in the observation of dataset and the second one is the removal of stop words from the dataset [4].

B. NEWS CLASSIFICATION AND SIMILARITY
Text similarity is comprises on the number of features extracted from the text. Two approaches were proposed to reduce the number of features of Arabic news text i.e. filtering of N-gram features and system combination. In this research, SVM based three classification systems were used and their results were compared with the conventional system which were based on cosine similarity without reducing the performance of the systems [5]. Two empirical methods also proposed that are basically text normalization of per document and feature weighting in which some weights were given to the features to solve the problem in naïve Bayes parameter estimation process [6].
Similarity measures are compared for lexical, probabilistic and hybrid methods on the basis of short texts phrases and the results showed that lexical measure is good for semantic matching. While probabilistic method is better for finding interested topics related to the matches [7].
Semantic and string based similarity in the sense of longest common sequence (LCS) of text by using corpus based measure is proposed by Islam [8]. While some of researchers presented novel approach to achieve good results for automated student short answer grading system, secondly this research also examined the effect of corpus size and domain on the usefulness of corpus based measure [9].
It has been studied that the text similarity on the basis of extraction of key sentence from a text document is feasible because the sense of whole document is presents in key sentence [10]. Strengthens and weaknesses of various text classification algorithms also compared such as Naïve Bayes Classifier, Decision Tree, Support Vector Machine (SVM), Fuzzy logic and K nearest neighbors (kNN), etc. while tested these classification measures for email filtering to get spam emails [11].
Discriminative approach presented for the learning of projection matrix. In this research low dimensional space is made by the help of raw term vectors. Researchers tried to optimize the cosine similarity model to make it reliable [12]. Some researchers presented their work for the news recommendation on the bases of news content similarity [13]. They tried to get user's interest from the target news by users.
Method to detect the duplicate web documents from a corpus of web documents by using syntactical approaches VOLUME 8, 2020 were proposed which used shingles of words, web URLs and semantic information [14].
Matias Fernando also presented his research related with text classification for the use of text summarization in which he presented four different techniques, first was standard classification technique, second was classifier generation using secondary data, third was semi-automated rule summarization extraction tool and the last was hierarchical summarization classification approach [15]. Some of the researchers made their research to improve the traditional text segmentation method by combining VSM with SM. This research was applied for the similarity of Chinese news texts [16]. One compared various algorithms such as Naive Bayes, Support Vector Machine SVM, K-Nearest Neighbor kNN, etc. Based on these comparison he concluded that kNN is best for his dataset. This research also concluded numerous methods to decrease processing time [17].
Naïve Bayesian algorithm, Sequential minimal optimization algorithm, k-nearest neighbors and Jaccard coefficient were compared for the classification of Arabic news texts and they came to know that sequential minimal optimization algorithm is better as a training model and a classifier for the classification of Arabic news text [4]. Some of the researchers used probabilistic approach for the classification of news articles on the basis of their headlines, they breakdown the headlines into words of bag and then give appropriate probabilistic weight to them and at the end they compared these words with other to get the similarity [18]. In a research news were classified into cancer, crime, health and terror on the basis of their content. The results showed 60%, 80%, 40% and 80% accuracy of their proposed methodology. Another research was made to get the similarity of text by extracting the features of documents. This research proposed three type of categories, in first category two document have same features, in second category only one document has same feature while in third category none of the documents has same features. According to this approach the similarity value increases when number of present absent features pair's decreases [19].
Researchers took a case study on news related tweets on Brussels bombing in 2016. They observed the behavior of the users and sense making information from the context information of their tweeted text or retweeted text [20].
Semi-supervised machine learning classifiers [32] have been used to classify the news with respect to their geographical information has also been proposed. They used geoparser.io to get the geographical information from news text. Geoparser.io is an API which is used to identify the location name from the text and it returns JSON object having unique id with respect to location on map [1]. Some researchers worked on educational news articles, they computed the similarity between news documents to separate positive and negative news them into positive and negative educational news on the basis of emotions used in news text. For this they build sentiment dictionary by using Laplace smoothing SO-PMI algorithm, which had information of frequently used emotions [21]. Some researchers worked on news classification into politics, weather, business, terrorist, showbiz and sports etc. on the bases of their news headlines. According to them, their proposed probabilistic method is 92.41% accurate [22].
Gap which is found in the above mentioned related research is that as we know data on multimedia websites is enriched with news on daily basis. While an automated crawler may read a specific news from various news sources which may result in exaggerated statistics about event count, and relevant information. In this research duplicate news articles coming from multiple sources are identified by applying two well-known similarity algorithms Cosine Similarity and Jaccard Coefficient.

III. CORPUS GENERATION
The primary dataset of news articles was collected from different well known national and international news websites through the web crawler which is written in Java programming language by using NetBeans tool which firstly visits the base URL of news website and then visit only those news URLs which have some human loss information and save the text of crawled news in documents of.txt format. The websites and their respective URL links from where the news crawled are given below in Table 1 from where the data is crawled. To the best of our knowledge, this is the first news corpus that includes the annotations for news categories as well as incorporates the annotations for duplicate an follow-up news articles.

A. CATEGORY WISE STATISTICS OF CORPUS
Approximately 5,000 authentic news articles crawled from these above mentioned websites to test and compare the working of two similarity approaches. It has been checked that the main story of the news is present in first paragraph of the news so that each crawled news article is consist of almost 100 to 200 starting words of published news article on a well-known authentic news websites.
Whole dataset consists of different types of news articles which belong to four different classes including accident, crime, terrorism and natural disaster. All these types of news articles have human loss information in them presenting the number of deaths or number of injured people.
Two annotators were initially allocated to manually label 500 news articles for their news category and whether the  news article is duplicate or non-duplicate, i.e. the article is follow-up or not. The annotators were supposed to relate the duplicate news with the source news as well. Their inter-annotator agreement was measured through Cohen's Kappa's statistical coefficient. The formula to calculate the Kappa coefficient is given below in Equation 1: where in above equation, P o is representing relative observed classified articles among annotators and P e is hypothetical probability of chance classified articles. The value of Kappa coefficient was 90% for these two annotators, which shows a very high degree of agreement between the two annotators. While, the other 10% deficiency was resolved with the help of third annotator. By the help of this process the rules to identify the duplicate or non-duplicate news article are obtained to annotate the rest of the news. Some of the news articles are totally different in the sense of lexical and semantic analysis, some articles have different sense of news story but have most of the text same, some news articles are similar with each other either in semantic order or lexical order. These types of news articles are follow-up of a specific news published in different dates on same or different source URLs.
It is seen through the dataset that some events occur on small scale and these type of news stories published on news website once, while on the other hand, some of events occur on very large scale so that news channel publish relevant news articles on same day again and again. While, if the event involves any human loss on very large scale, then there is a possibility that this type of event may be published on different news channels on yearly basis too. All the news articles are saved on secondary storage memory in the form of documents in.txt format. The distribution of news articles and their statistics are given in Table 2, while a visualization of category-wise news statistics has been shown in Figure 3.

A. NEWS CLASSIFICATION
One of the important objective of this research is to build a classifier which can be used to classify the news dataset and give appropriate class tags to news stories. To this end, a conventional process has been adopted as shown in Figure 4. The whole news corpus is divide into training set and testing set to check the accuracy of different models. No doubt that a large number of text classifiers have been studied and used for many different purposes as it can be seen through the related studies. However, it is not simple to take any of them as suitable classifier for the classification of human loss news corpus. Most prominent and effective classification techniques for text data are Random Forests (RF) [23], [24], Support Vector Machines (SVM) and Multinomial Nave Bayes (MNB). Random Forest has also used in for different fraud detections [25] through textual data. Therefore, this research involves the testing of all these well-known classifiers to choose the best one for the classification of human loss related news corpus. It is pertinent to mention that the data set involved in this research is not large enough to get promising results using deep learning approaches. However, we employee simple neural network based classifier to categorize the news articles. Therefore, the experiments have been conducted using four different text classifiers:

B. NEWS TEXT SIMILARITY
Text similarity is a method which helps to identify the relationship between text snippets. Text similarity is categorized into lexical and semantic similarity. In lexical similarity text is matched on the basis of its characters for example ''love'' and ''dove'' are lexical similar to each other because of most of the characters are same in both words i.e. ''ove''. On the other hand, semantic similarity computes the similarity on the basis of meanings or synonyms of words. According to semantic similarity ''love'' and ''dove'' both are not similar to each other. But if the example of ''good'' and ''fair'' is taken, then these two words are similar on the basis of semantic similarity because there meaning are same. The news stories are well edited and are generally written using typical words. Every newspaper has different news reporters and editors, so there is a good chance that the same news is presented using different set of words depending upon the vocabulary and writing style of the reporter and editor. Two widely used similarity measures Jaccard coefficient and cosine similarity are applied on news articles to calculate the similarity score of any two news are values between news articles. But in order to reduce of time of similarity computation the stop words are removed and some further pre-processing is also applied which is discussed next.

C. PRE-PROCESSING
Preprocessing is a first and basic step in news followup identification. Crawled data is combination of garbage and useful information. To increase the speed of system, data must be preprocessed to convert it from the form of unstructured data to structured data. All the data is crawled from various different sources of news channel websites so that there is high possibility of having noisy data in it in the form of html tags and stop words which may be useless for the system. It should be preprocessed first and to be free from punctuation marks, quotes, irrelevant text, stop words, semicolons, exclamation marks etc. Figure 4 shows a block diagram of all steps involved for news classification and duplicate news detection. Figure 5 is showing all the steps involved in pre-processing of a news and Figure 6 is showing the algorithm which is used for pre-processing.

1) NEWS CLEANSING
News is crawled through various news channel websites and contains html tags in its text. These tags will not add any positive impact in computation of news similarity that's why we need news cleansing process at the first stage of preprocessing.

2) NEWS TEXT TOKENIZATION
Breaking down text into words or tokens on the basis of spaces is known as text tokenization. Each word after tokenization is considered to be a string. These tokens can be used to make a dictionary for the word comparison in document matching [26], [27].

3) REMOVAL OF DIACRITIC WORDS
Diacritic are language dependent. As the news are written in English language so that diacritic in English are those words which includes commas, quotes, full stops, double commas, brackets, dashes, special characters and underscores etc. are removed. Simple way to remove these types of words is to replace all these words with space [26].

a: STOP WORDS REMOVAL
Stop word removal is used in experimental setup where different variants of SWR stop word removal and S Stemming are used to test the accuracy of both proposed similarity computational techniques.
It is observed and calculated through different SEO experts that a simple text document can contain 70 percent of stop words and only 30 percent of words drives actual information. Stop words are most common words in natural language which are used in almost every sentence and it affects the efficiency of any natural language processing algorithm. It just carries syntactic information in the formation of sentence [28]. For example, ''is, are, am, was, were, etc.'' these are the some examples of stop words which have more term frequency in documents but have less important meaning than keywords. Term frequency means that a term or word occurs how many times in a document. Some researchers showed effect of stop words in Arabic information retrieval by using comparative study [29]. As a preprocessing the first step is to eliminate all the stop words from the news articles to speed up the core task in news text preprocessing. List of stop words is provided by Journal of Machine Learning Research and used by many expert researchers [30].

b: STEMMING
Stemming words are those words which have same root word. For example, Love is root word of Loved, Loving, Loveable, Lovely, etc. So, instead of storing all these words, only love word should be stored in repository to minimize the storage memory and computation time of follow-up news identification. The Porter stemming algorithm is used to remove suffixes from the tokens of a news text [31].

D. NEWS SIMILARITY APPROACHES 1) JACCARD INDEX
Jaccard Index is used for full text similarity of news text which is also known as Jaccard Similarity coefficient. This technique is used to measure the diversity and similarity of text sample data set. Similarity is computed by total number of matched words by the total number of words present in original text. Suppose, A and B are two news texts formed after applying preprocessing. Equation 2 is used to compute the Jaccard Similarity coefficient.
Jaccard coefficient value varies between 0 and 1. The dataset on which this technique is applied has different news downloaded from different sources and each source have different editor who writes news which is clearly explained in the section of dataset. So it's a challenging factor to identify the follow-ups of news on the basis of full text similarity. This approach is failed on follow-ups news identification. Suppose same incident is occurred in same area with gap of some days. The text similarity of these type of news might be above 90 percent and we cannot add feature of date in this way because it might be possible that different incidents are occurred on different dates of same type. It also might be possible that same news was written by different editors in different ways which results in less value of Jaccard coefficient. This system has one main flaw of efficiency that will occur by the passage of time when dataset will be increased as in this approach two values of shingles are applied.

2) COSINE SIMILARITY
Cosine Similarity is an algorithm used to determine the similarities between documents, without using the traditional methods like counting words (Euclidean distance). In our case each document is a news article, so it can be used to computer the similarity between news articles. Term frequencies of words i.e. TF will be calculated for each word and then arrays of words containing the similar words are projected on a multidimensional space, and cosine angle between them is calculated using the formula given in Equation 3.
The value of Cosine(θ ) ranges from −1 to 1. −1 indicates the results are perfectly dissimilar. 1 indicates the results are perfectly similar.
The algorithm which is used to detect the duplicate news articles is given below in Figure 7:

V. EXPERIMENTS AND RESULTS
A. EXPERIMENT SETUP AND EVALUATION METRICS FOR DUPLICATE DETECTION 1) EXPERIMENT SETUP K-fold cross validation with respect to 10 folds have been applied to perform the experiments. It means that whole dataset is divide in this way that 90% of the dataset if used for training purpose and 10% of the dataset is used to testing purpose. It is ensured that this division is stratified, that is, the percentage of instances for each category in the training and test sets remains the same. Training and testing of algorithms is repeated 10 times. Average results of these 10 iterations have been reported. In the end, a discussion on the computed results provides a synthesis of all the experiments.
It can be seen in Table 2 about the statistics of actual dataset. Precision and recall is used here to compute the f-measure of each class (discussed below), on the basis of which we compared the two approaches. Four different classes {Accident, Crime, Disaster, and Terrorism} are tested with and without preprocessing.

2) EXPERIMENTAL SETUP
Two variants of preprocessing that are stop word removal and stemming, applied on the dataset to check the precision and accuracy of system. In Table 3 and Table 4 'P' is used to represent precision value and 'R' is used to represent recall value, while 'SWR' means that stop words were removed from the news article and 'S' represents that the text was stemmed using Porter Stemmer before applying Jaccard Coefficient or Cosine Similarity. In the following Tables '+' is showing that the variant was applied and '-' is showing that the variant was not applied. So by the combination of stop word removal and stemming four different variations of testing are made {− −, + −, − +, ++}.

1) NEWS CLASSIFICATION RESULTS
Category-wise and overall results for all selected classifiers have been presented in V-B2 which shows that MNB and SVM are performing better than RF. Table 3 shows that MNB is the best classifier for two categories Accident and Disaster While on the other hand SVM is the best performer for classifying the news stories related to Crime and Operation category, whereas, RF is only suitable to identify terrorism   news stories. The overall results show that MNB performs well for classification and thus is overall best classifier with 89.5% accuracy for all types of news stories. The same data of results has been shown in graphical form in Figure 8.
It can be observed through the above Figure 8 that the accuracy of all the classifiers MNB, SVM, and RF is almost the same and SNN is very low because it is based on neural network. The reason is that neural networks suffer from curse of small data sets and perform well when they are trained on large dataset. But this research not only involves news classification but also incorporates duplicate news detection, and a data set of nearly 5000 news has been compiled to server both purposes. Though some other larger data sets for news classification exist that involve news related to sports and politics etc., but unfortunately none of them annotates duplicate news, as well as they do not incorporate human loss news. Therefore, the proposed approaches have been tested and evaluated on the human loss news data set compiled in this research.
A deeper analysis reveals that MNB and SVM are equal at 1% level of statistical significance whereas RF is also    equal at 3% level of statistical significance. Thus, it can be inferred that this little difference could be cause by chance as well. Hence all three are equally well as for as accuracy is concerned. However, MNB is computationally cheaper so it stands out as the best option. Table 4 represents precision and recall values computed by using Jaccard coefficient. Whereas, Table 5 presents the values of precision and recall computed by using cosine similarity. Table 6 and Table 7 are presenting the f-measure score using Jaccard coefficient and cosine similarity, respectively. Since the data instances are imbalanced therefore F-measure can provide more accurate results, as it is computed by involving both precision and recall shown in Table 4 and Table 5.

C. DISCUSSION
After analyzing accuracies it can be claimed that MNB is accurate, near to perfection and efficient in terms of time than the other two selected variants. Therefore, it can be concluded from this statistics that MNB is overall the best from all the considered variants for the classification of human loss news classification. While by the analysis and testation of both the algorithms Jaccard coefficient and cosine similarity with different four variants of pre-processing for the duplicate detection of news articles, the maximum values of f-measure for four categories can be seen in Table 6 and Table 7 and it can also be seen in that the overall average f-measure computed through Jaccard coefficient is 86.13% and through cosine similarity is 72.34%. Jaccard Coefficient worked much better than Cosine Similarity to identify the follow-ups of a particular news story. But if we see the working of Jaccard coefficient with respect to different parameters of preprocessing then it perform well when SWR is applied before computing similarity. It can be seen in second and last row of Table 6 that value of f-measure is above 87%. For disaster related news articles the highest value of f-measure in Table 7 is approximately 90.4%, in these type of news articles application of variants has not a significant effect on the working of similarity measure. Natural disasters have more biological effects on earth and mostly these type of articles have higher death toll and injured toll so that these news articles repeats on yearly basis too. Most of the text with same statistics publish yearly, this is VOLUME 8, 2020 the reason that when we apply similarity measurement on it, it gives good results without applying variants but in the other types of news articles good results are more dependent on variants.

VI. CONCLUSION
This research presents an authentic news dataset comprising of four different classes of news namely: accident, crime, terrorism, and natural disaster. The data set involves news stories and their duplicate or follow-up versions in the same or different news portals. On the other hand, it provides a comparison of three well-known text classifiers and two widely used text similarity approaches to automatically identify follow-ups of a specific news article while involving different pre-processing variations. The results reveal that MNB is accurate for classification, while Jaccard Coefficient performs much better than Cosine similarity for duplicate detection of human loss related news article. While, for most of the classes stop word removal and stemming collectively improve the result. A deeper review of the bad results and wrong reported duplicate news results shows that only text of a news is not enough to detect follow-ups as the error rate in the results of this research is 13.87%. In future, this work can be improved by involving more domain specific semantics of a news by extracting relevant information through preprocessing. Lastly, there is a need to extend the data set to make it suitable for applying state-of-the-art deep learning approaches.