An Unsupervised Approach for Content-based Clustering of Emails into Spam and Ham through Multiangular Feature Formulation

The rapid growth of spam email attacks and the inherent malicious dynamism within those attacks on a range of social, personal and business activities warrants an intelligent and automated anti-spam framework. Attempts like malware propagation, identity theft, sensitive data pilfering, monetary as well as reputational damage are sharply increasing, endangering the privacy of the victim. Current solutions that are rather incomplete when the multidimensional feature range of email, is taken into account. We believe a methodology based on Artificial Intelligence, especially unsupervised machine learning is the way forward. This research attempts to investigating the application of unsupervised learning for the clustering of Spam and Ham emails. The overall goal of the research is to develop an unsupervised framework that solely depends on unsupervised methodologies through a clustering approach that includes multiple algorithms, primarily using the email content (body) and the subject header. The clustering has been done on a novel binary dataset of 22,000 entries of ham and spam emails, composed of ten features (reduced from eleven to ten after the feature reduction). Seven out of these ten features are unique to this study, engineered to represent impactful analytical email characteristics from a multiangular point of view. Out of five different clustering algorithms investigated in this work, OPTICS produced the optimum clustering demonstrating a 0.26% higher average efficacy than its nearest performer DBSCAN. The average balanced accuracy for OPTICS and DBSCAN was found to be ≈75.76%.


I. INTRODUCTION
Email is an important medium of digital communication throughout the world. Billions of individuals rely on email communication for their personal, social and business needs. Unfortunately, the ubiquity of email has made it a perfect target for scammers to turn this seemingly simple but effective communication tool into a manipulative carrier of potentially damaging outcomes. Email spamming is generally defined as the act of dispersing messages that are unsolicited, oftentimes in large volumes, using the medium of email. On the other hand, emails that are communicated for genuine, lawful and authorised purposes, are defined as Ham [1]. Spamming is deployed for both marketing purposes, and to inflict reputational and financial damage, both on the personal and the institutional front. Financial gain is regarded as the driving motivation behind spamming, generating a yearly gain for the spammers of around USD 3.5 million [2]. By the end of 2020, over 4.1 billion email accounts were registered globally [3]. Approximately 306 billion emails, in 2020 alone, were exchanged of which a lion's share of 55% were identified as spam emails [3]. These large volumes of unwanted email, apart from causing damages, waste users' time and patience as well as communication bandwidth, server memory and CPU cycles. Business Email Compromise (BEC) attacks resulted in a financial damage of around USD 3.5 billion in 2019, as reported by FBI [4]. Australian consumers and businesses, by the end of 2019, had lost over AUD 28 million due to email fraud [7]. An average person spends around 28% of a regular workweek interacting with emails [5]. 38% of these emails, equivalent to ≈11% of a workweek on an average are actually relevant.
Traditionally the United States has been the most prolific source of spam emails (both phishing and marketing spam). However, the trend is on the rise in other parts of the world and in the first half of 2020, Russia has been to be the largest source [6], with over 20% of spamming worldwide [6]. The United States and Germany are closely tied in second and third position respectively with 9.64% and 9.41% of global spamming [6].
Though a number of propositions are available for spam filtration systems based on supervised or semisupervised algorithms, work based upon unsupervised methods is virtually non-existent. A few examples of the use of K-means clustering are available but the processing overhead and general lack of practicality of these systems are bottlenecks for widespread implementation. Unsupervised algorithms have fundamentally distinct advantages over supervised methods, which we believe are key for true AI based systems. These differences will be discussed more broadly in the next section.
The contributions of this study are, I) Developing a novel and comprehensive database of email content and subject based features from multiple publicly available email sources. This database may be used for other relevant research, II) Introducing a novel feature-set, formulated based upon a characteristics of content and subject of an email, and III) Critically investigating the clustering outcomes of a number of unsupervised algorithms on this dataset comprising of mostly novel features representing both ham and spam emails.

II. BACKGROUND OF THE RESEARCH
An extensive number of related research attempts, both supervised and semi-supervised, have already been completed. In spite of this, until now, research initiatives that fully rely on unsupervised methodologies to separate spam emails from the legitimate ones are hard to find. This research is a comprehensive initiative at addressing this gap. In this paper, we target the subject and content of the email, while in our previous work we addressed the header and domain information. The problem domain of spamming is not confined to one particular aspect of today's email subsystem. All sub-parts need investigation. The edge that unsupervised learning has over supervised learning, as well as the lack of research available on this topic of this study, have motivated this research. This is not to say that unsupervised methods are better than supervised or vice versa. The working procedure of unsupervised algorithm, however, even though further development is needed, is something we believe has more potential in developing highly autonomous systems leading towards a true AI based framework. Supervised algorithms need labelled data to work with, where the possible output for the corresponding input is already stated and the algorithm learns from the mapping; sourcing and managing such labelled data. This is often quite a difficult and complicated task [8]. Unsupervised clustering, on the other hand, has the advantage that it operates on unlabeled data; and requires no training. Based on the dataset, the algorithms attempt to find the set of common features within a batch of assorted items and rearrange the data points in clusters, based on the commonality [9]. In Supervised Learning, inputs demonstrating little variation in the training dataset can produce outcomes with high error rates in the inference phase, due to the fact that the model could not be trained to appropriately recognise unexpected and rare patterns. Unsupervised algorithm are often better at finding patterns or relationships among features that are too complex to detect through ordinary data analytics or observation. This study dissects the subject header and content of an email through such unsupervised clustering and investigates the clustering performance. We also focus on developing versatile and relevant feature set from multiple angles that impact on the clustering process and could generate a unique fingerprint for spam emails. This will help us in getting an objective understanding and quantifiable knowledge about the effect of clustering. We use some common clustering algorithms with custom feature engineering based upon diversified characteristics of email content and subject. This study does not rely only on raw K-means clustering of words -the most common form of content clustering that has already been used in a number of earlier research.

III. STRUCTURE OF THE PAPER
Analysis of necessary background studies is discussed in Section 4. The section after that, Section 5, describes the proposed method briefly. Section 6 has a brief discussion on the datasets and features used in this research. Section 7 details the techniques used for the overall proposed framework, feature construction, feature selection, dataset construction, clustering and its evaluation. Section 8 describes the outcomes of the research and a few limitations that we have observed. This is followed by conclusion and some directions for future work.

IV. RELEVENT STUDIES
Though it is quite difficult find closely related work, as stated earlier, this section analyses somewhat related Machine Learning based research initiatives. Most of these are mainly unsupervised in nature or have at least deployed unsupervised learning techniques to address key aspects of the proposed system. We are mainly focusing on systems that have critically analysed the email content for their automated approach.
Basavaraju and Prabhakar [10] introduced text clustering using 'Vector Space Model (VSM)', an algebraic model for the representation of text documents as vectors of identifiers [11]. The method performed reasonably well on spam email identification. Data were VOLUME XX, 2017 represented using VSM (often known as Term Vector Model) and dimensionality reduction was carried out through a custom developed clustering framework based upon BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) and K-means algorithms. The system demonstrated an average accuracy of just over 74% on a marginal dataset of 400 records with four different combinations of BIRCH and KNN. The authors employed raw words from the documents to formulate the VSM. A general limitation is that when spammers decide to employ character variations, for instance representing the word 'insurance' as i * n$u * rènce, their proposed framework will not be able to detect these variations.
Laorden et al. discussed a system [12] that identifies spam emails through a content based anomaly detection framework. The proposed system works by drawing a comparison between features such 'Word Frequency' to those of a ham (legitimate emails) dataset. For a significant deviation from the normal scale, a spam alert is raised. The researchers have used a variation of Kmeans Clustering called 'Quality Threshold (QT)', which provides an edge in downsizing the tally of vectors within the dataset that is designated as the 'normality'. Processing overhead is reduced significantly with this technique. On the other hand, the system may not work as expected for the application of language features such as Hyponyms [13], Synonyms and Metonymy. The authors have demonstrated a weighted accuracy of 92.27%. The experiment used the LingSpam dataset [61].
K-means, along with Expectation Maximization (EM) were also used by Halder et al. [14] to develop a framework that works on particular schemas such as stylistic characteristics or features of emails (total number of contractions and punctuations, total count of email IDs used within the body etc.). Different semantic features, such as the statistical measures of different words used in a batch of emails were also investigated. A combination of these two approaches was considered as well. The eventual cluster analysis was carried out on a dataset of 2,600 spam emails. The authors demonstrated that the method could be used to detect the composition styles of spam campaigns. Furthermore, the extracted patterns can also be used to build prototypes for prospective future identification of spam emails. When a combined approach is taken, Kmeans produced an 80% success rate. On the other hand, while dealing with only semantic features, EM projected a success rate of 84.6%, whereas the detection rate drops to 57.4% if a combined approach is considered. The result of the experiment was reported in terms of the 'purity' of clusters, a measure of cluster quality. The authors' area of focus [14], is generally rather limited as there is a range of critical features in spam email identification such as URLs composition, email subject headers, attachments, detailed domain as well as header information etc. which have not been included in the framework.
Unsupervised Self-Organized Map based systems have also been explored by researchers such as Cabrera-León et al. [15]. Their introduced system works on 13 different categories for emails. The authors started with a 4-stage preprocessing of emails (both ham and spam). In the first stage, batch-extraction of all the emails' subject and content was carried out and alphanumeric characters replaced the whitespaces. The next stage removed all the stop words and derived raw term frequency scores in addition to some other critical metadata (spam\ham) used in the processing. The third stage developed a 13dimensional integer array to store the themes and categorize the processed texts. The preprocessing phase ends by attaching 'weights' to the words of all of the 13 categories. SOM was then deployed to build the model with 'Batch' learning method. Eventually, a threshold value was put in place to label the clusters. An accuracy of 94.4% was achieved by the framework. However, an issue with the system is that the performance for off-topic emails, was reduced, indicating potential room for further enhancement.
Padhiyar and Rekh [16] presented a semisupervised model that is based upon K-nearest neighbor (KNN) and Naive Bayes (NB) algorithms. The authors demonstrated that this achieved better classification accuracy than a standalone KNN or NB based methods. However, in-depth inspection reveals that in all likelihood the work will not reach the expected performance when the availability of initially labelled documents is limited. The study addresses the issue by introducing Expectation Maximization (EM) algorithm to manage a dearth of labelled data, however this has not actually been implemented within the proposed system. The solution requires an effective feature selection and pre-processing segment.

V. PROPOSED APPROACH
As the name indicates, unsupervised learning based models work only with unlabelled data so no training phase is involved; whereas supervised techniques have the requirement of training over a large dataset often requiring costly data labelling [17]. Unsupervised algorithms most commonly attempt to discover a common pattern associated with the features being processed within the dataset [18]. The algorithm rearranges the data items in separate clusters. Unsupervised learning is less time consuming and computationally efficient [19], [20] than supervised approaches. In addition to the usual 'distance based' clustering [21], where specific distance metrics such as 'Euclidean Distance' [22] are used to calculate similarity between data items or objects, 'density based' clustering is often commonly used. As shown in Fig. 1 (a), our approach comprises of building a raw dataset of pre-processed contents and other features from a number of publicly available email collections of both ham and spam emails. This dataset is then converted to binary form. A feature selection process is applied through a mechanism best described as "feature reduction through feature elimination", which is explained later. The resulting dataset holds the most important features out and is ready for clustering.
In the subsequent step five clustering algorithms -K-modes, DBSCAN, OPTICS, K-means and Spectral are analysed as illustrated in Fig. 1 (b). Some of the unsupervised algorithms but not all, can be programmed to generate only two clusters. In this research only these algorithms have been investigated. Once the cluster formations have been analysed, we apply an array of external and internal validation metrics to the generated clusters to appropriately quantify the performance of the algorithms. These validation steps then indicate the top performing clustering algorithms. Section 7 provides further details of the above steps, but we will first describe the datasets and feature construction in section 6.

VI. DATASETS AND FEATURES USE
Our research is based on a custom-developed dataset where the features are of binary nature. The feature-set is a mix of some commonly used features used in similar research and some novel features (formulated based on the diverse characteristics of email content and subject). To develop the binary dataset of 22,000 records (around 68% of which are spam emails whereas the remaining are ham), where each of the records contain eleven features, we employed a number of publicly available datasets, such as the 2017 and 2018 spam collection by Bruce Guenter [23], TREC [24], the Enron Dataset [25], Hillary Clinton's political emails released by US State Department [26], highly used spam words and phrase collection by Theo Freeman [27], and Fraudulent E-mail Corpus by Rachael Tatman [28]. These datasets contain phishing spam as well as advertising spam emails and ham emails (in raw text format). The binary dataset has been made publicly available [62].

A. FEATURE CONSTRUCTION
This section highlights the features that have been used for the research. Feature f2 and f3 have been used previously in other studies [14,33,34], whereas only the concepts of f0, f1, f4 and f7 have been described in other studies but they have not been applied from the angle we have chosen in this research. Features f5, f6, f8, f9 and f10 are a novel set of feature that we have not found in other studies related to thus topic.

VII. DETAILED WORKFLOW
we will here describe each of the subsections as illustrated in Fig. 1 (a and b), including validation procedures and the algorithm's performance.

A. RAW TEXT TO PRE-PROCESSED BINARY DATASET
This stage includes the preprocessing of content and subject headers for each of the emails, leading first to the creation of the non-binary dataset of eleven features, and then the eventual binary dataset.

A.1. Content and Subject Header Preprocessing
Instead of straight clustering the email contents and subjects, which is the most common way of creating clusters of probable ham and spam emails from a mixed collection, we have create novel features and included some features that have been conceptually described in other research initiatives . Preprocessing of content and subject header is critical to feature f0 and f1. Textual data are highly unstructured and preprocessing allows simplification of semantically duplicate words, removes noises or words that do not have any real impact and compacts the overall text for further processing. The complete gamut of preprocessing steps allows both f0 and f1 to produce more accurate metrics. For the other features preprocessing has not been done. The following preprocessing techniques have been applied:

A.1.a. Stripping HTML tags
The raw text files available in public domain contain a number of HTML and other similar tags within the content and sometimes in the subject header. These tags have been removed.

A.1.b. Removing lone characters
All single characters were removed as part of the preprocessing as these did not have a meaningful impact.

A.1.c. Expanding the contractions
Contractions are abridged version of syllables or words. These contractions were expanded to its full form; for instance, you've becomes you have.

A.1.e. Tokenization and Removal of Stopwords
Tokenization is the mechanism of breaking a document down into its individual parts called tokens, such as words, punctuation marks and numbers. The complete content and subject headers for individual emails are tokenized and certain Stopwords were removed in this stage. Stopwords are mostly pronouns, prepositions and linking verbs (such as 'is, are', 'was' etc.). We have used the python library spaCy [29] for the preprocessing, which has a rather long list of Stopwords. We have removed some words from the default spaCy Stopwords repository which we deem important for this particular research. 'Appendix C' contains those words.

A.1.f. Lemmatization
Lemmatisation is the algorithmic process of grouping together the inflected forms of a word in order to analyze them as a single item, identified by the lemma of the word [30]. Generally many words appear in several inflected forms. For instance, the verb 'to sleep' may appear as 'sleep', 'sleeps', 'slept' or 'sleeping'. The base form, 'sleep', is termed as the lemma for all the other versions of usage. spaCy has also been used for Lemmatisation. The Lemmatisation process had been run three times back to back to refine the preprocessed content and subject. Figure  2 shows an example of this process. VOLUME XX, 2017

A.1.g. Removal of non-English words
At the final stage of preprocessing, words or lemmas that are not found in English vocabulary, such as incomplete or alphanumeric tokens, are removed.

A.2. Feature Construction
The features which have been used to initially populate the dataset (X in Fig. 1 (a)) are described below.

F0:
The total number of spam words present in the content has been used in some studies. However, we have also accounted for phrases (such as 'billion dollars') instead of words only. We calculate the percentage of the total count of such spam words and phrases to that of the total words present in the content. This collection of highly probable spam words and phrases has been taken from a publicly available source [23]. It is also supported by a number of other web sources. An excerpt has been added in 'Appendix A'. F1: Similar to f0 but applying to subject header. Only words no phrases are used for this feature.   F5: Percentage of words that are not found in English language vocabulary. For instance, "cloze his bank account", the feature would yield a value of 0.25 as the word "cloze" is not an English word. NLTK Wordlist [31] has been used in this research as a comprehensive dictionary of English language words.

F6:
This feature represents the total number of words which are not in the English dictionary but have at least one valid word in the dictionary that is a close match. If y is to be considered as a Close Match of an invalid word x, the following set of conditions must be True: a. The length of x has to be greater than two characters and must not contain any of these special characters -"()[]{}+-_*/,\~|%``" to begin the process b. The first and last character of x and y need to be the same c. A variation or difference in at most two characters is allowed keeping the first and last character of both x and y same d. y must be of the same length as x e. These three special characters -'@', '!' and '$' are allowed within x but cannot be the first or last character Table 2 provides some examples on how this feature may work against invalid words; whereas Table 3 details how  the examples in Table 2 satisfy the conditions (a to e).

F7:
This feature provides the percentage of words related to currency, such as USD, AUD, U$ etc. The complete list of the most commonly used words that have been taken into consideration for this feature can be found in 'Appendix A'.

# invalid word (x) Close match(es) (y) eg0
'a@@ain', 'attvin' or 'at$akn' attain eg1 tein, tuun turn or twin eg2 eztreme extreme  There are several steps before the final value is reached as outlined below: • From the pre-processed content, spam words and phrases are first identified. Phrases are turned into single word through the usage of '_' within words. Pre-processing stages will be discussed in a later section.
• Spam words and phrases are then numbered as shown in Fig. 3. Numbers are added in a descending order primarily because of the ease in algorithm development. • These numbered words and phrases are sorted alphabetically as shown in Table 4 • Forty-five word pairs are then derived in a view to calculate the distance between the highprobable spam words and phrases (Fig. 4). The combinations are derived using (1) where n is the aggregated total of words, ten in this case.
• The distance between the words in a pair is determined. For the purpose of simplicity, three word pairs out of the forty-five have been used for visually demonstrating how the distance is calculated, in Fig. 5.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3116128, IEEE Access VOLUME XX, 2017

FIGURE 5. CALCULATING THE DISTANCE BETWEEN WORDS WITHIN A WORD PAIR
• All the forty-five distances are placed side by side, starting from the first word pair and sequentially moving to the last, and a space character is inserted after every fourth value. For instance, if we have six distance values such as: 22, 2, 0, 6, 10 and 35, we would construct complete strings, s, as shown in Fig. 6. The space character is added to simplify the hashing operation. • In this step, the numeric string s is being used as the input for the hash function. The hashing algorithm that we have used here is MinHash [32].
The hash code that is generated is particularly useful for identifying spam campaigns as in the case of such campaigns, a large number of emails are spread which may have a slightly different segment of texts in the form of receiver's name, address, office and account information etc. but the majority of the content remains the same.
MinHash has a distinct advantage in such cases as slight changes in the original content does not change the hash code significantly. It will normally generate closely similar hash codes for emails having reasonably similar content and a clustering effort can produce useful results.
Though we have used this feature in a different manner in this research, we do have plans to extend this idea to carry out clustering operations on spam campaigns.
The hexadecimal hash string (Fig. 7(a)) that is now generated by MinHash is transformed to fully numeric by removing the alphabetical characters from the string (Fig. 7(b)), the length is also limited by stripping the first twenty-two digits as these are mostly '0's and rounding up after the eighteenth digit (Fig. 7(c)). Finally the number is finally changed into fraction by moving all the digits to the right of the decimal point. Figure 7 shows a hash generated by the algorithm and the subsequent steps till the resulting final fractional value.
F9: Keeps a total count of those words within the content that have three consecutive identical characters, for instance, "profiiit". With such techniques, spammers often try to evade conventional spam filtering frameworks that rely on the correct identification of suspicious words.

F10:
This feature provides a measure of how the spam words and phrases are spread over the content of both ham and spam emails. The standard deviation, the average of squared distance from the mean, has been used on the distance points, calculated as shown in the discussion section for feature f8. The series of resulting values produce distinct patterns of variability for both types of emails which can eventually be used as a feature of useful impact.

A.3. Formulation of the Binary Dataset
Both the non-binary and the binary dataset of 22,000 entries contain eleven features. The dataset of both ham and spam email is ordered in a random fashion. To convert the dataset from non-binary (X in Fig. 1 (a)) to binary (Y in Fig. 1 (a)), we have deployed the Standard Deviation (SD) of nine of the features to be the decider; while the remaining feature (f3) is already in binary form. For f7, a default value of '0\1' has been used based on the extent of presence of multiple currency related tokens. In our initial exploration of the non-binary dataset, it could be observed that that the majority of features have higher numeric values for spam emails than that for ham, whereas lower scale numbers are more evenly distributed between the two classes of email. We we consider the variability as impactful. The SD in this case provides a reasonable barrier of separation which is required for clustering algorithms s. If is a data point within a feature f and is the SD of f, then its binarized equivalent fB holds either 0 or 1 as per Table 5. Binarization often reduces the randomness in a dataset, increasing the processing efficiency. An excerpt of the dataset is shown in Fig. 8. instead total count of currency tokens used VOLUME XX, 2017

B. FEATURE REDUCTION THROUGH FEATURE ELIMINATION
We used both Laplacian [35] and Multi-Cluster based Feature Selection (MCFS) [36], two unsupervised feature selection algorithms, on the binary dataset. There were a few more choices of feature selection algorithms but these were not properly scalable to fairly large datasets. Before shedding light on our proposed feature reduction method, a brief discussion on the two algorithms is in order:

B.1. Multi----Cluster----based Feature Selection (MCFS)
Generally MCFS produces an optimized feature-set through the application of an 'L1−regularized leastsquares' problem [36] as pointed out in (2). MCFS can preserve the multi-cluster structure of the inputted featureset, while spectral analysis is carried out over the data points to determine the correlations among features. The unsupervised characteristics of the algorithm allow it to work out the correlation even in the absence of corresponding labels. To rank the features, they are assigned a score (termed 'MCFS Score') based on the maximum coefficient of sparse representation. The most useful features are subsequently ranked in descending order. MCFS performs particularly well when the total features is less than fifty [37].

B.2. Laplacian Score for Feature Selection
The key working principle for this algorithm is the inference that data points placed within the same class are often closer to each other; therefore it is possible to measure the importance or gravity of a feature through its degree or capability of locality preservation. Laplacian score initiates the procedure by embedding the data points on a nearest neighbor graph T containing m nodes. The i th node stands for the element zi. The graph facilitates a connection to zi with another node or element zj, which belongs to k nearest neighbors of zi. The Weight Matrix W of T, defined using (3), illustrates the local structure of the data space [7].

B.3. Proposed Method for Feature Selection
The primary aim of adopting the process of Feature Selection is to have confidence that the feature(s) that are eliminated indeed offer significantly less variation. Once the ranking is done by both the algorithms on the full feature set of eleven features, the two least important features produced by both algorithms are separated , that is the two features that were at the bottom of the ranked feature list for each of the algorithms. Out of these two sets (s1 and s2) of four features in total, f9 was the common one. It has therefore been removed from the final set of features resulting in a set of ten features for clustering purposes. Note that if no common feature is present within s1 and s2, the original set of features will have to be used for clustering. Pseudocode 1 explains the feature reduction procedure, where Saf is the set of all features and R is the set of most important features. As can be seen from Pseudocode 1, the dataset Z of Fig. 1 (a) is composed of the features in set R. This dataset is now ready for clustering, as detailed out in the next section.

C. APPLICATION OF UNSUPERVISED CLUSTERING ALGORITHMS
This section will discuss the algorithms that have been employed for clustering purposes ( Fig. 1 (b)). As mentioned previously, only those algorithms that allow us to parametrically control the number of clusters created, have been deployed. In the subsequent sub-sections, the resulting clusters will be investigated through 3D Scatterplot visualisation.

Applications with Noise): DBSCAN is an algorithm
where 'Density based Clustering' has been implemented over centroid-based clustering as we will see in K-means. Distinctive clusters that are present within the set of data points can be recognized through density based clustering (an unsupervised learning method). The algorithm assumes that clusters in a data space usually form a contiguous zone of high point density, which can be disengaged from other similar dense regions [38] [7]. DBSCAN initiates the process by calculating an approximation of 'density' and then subsequently pushes those data points that are placed in the low dense areas further away from each other, as well as from zones of high density. The 'Mutual Reachability Distance (MRD) (5)', achieves the task [7].
corek (p) or the 'Core Distance' is calculated for parameter d for a point u as the minimum value of radius necessary to classify u as a core point. However in case the given point is not a Core point, then its Core Distance is undefined. Fig. 9 highlights the concept of Core Distance for d=4.

FIGURE 9. CORE DISTANCE
So, if the minimum points per cluster is 'large', then the linking 'core distance' will turn out to be larger as well. x (u, v) is the original metric distance between u and v. Now those dense points that have reasonably low core distances, are to be kept at the same distance apart from each other while the points that are sparsely placed will be pushed apart (with a distance which at least equals their core distance) to be positioned away from any other point. DBSCAN then invokes Minimum Spanning Tree to pinpoint these dense regions before the extraction of the resulting clusters [7].

C.2. K-means:
K-means is a key clustering algorithms (centroid-based) that is able to cluster together data points that are similar enough to generate some underlying pattern. The final output from K-means is obtained by an algorithmic method called iterative refinement. K-means minimises the sum of the squared distance between the cluster's centroid and the data points. Cluster centroid is the arithmetic mean of all the data points belonging to that cluster. K is observant of the number of groups (or the initial number of clusters that has to be inputted). Based on the identified similarities among the features, each data point is iteratively allocated to one of these groups of clusters [39]. In the majority of the cases K-means applies 'Euclidean Distance' to work out the distance between two data items or points (V n and V m ) as shown in (6)

C.3. Spectral:
Spectral clustering, a graph-based clustering technique, has improved performance in some cases compared to K-means. The clustering process first generates a 'Similarity graph' (a non-negative symmetric graph) of M objects. The most universal way of constructing this 'Similarity graph' is the ε-neighbourhood graphs or otherwise called Epsilon neighbourhood graphs. Generally, K-means is internally employed by Spectral clustering to achieve the grouping of objects into k clusters but prior to that, a feature vector is created for each of the M objects by identifying the first k eigenvectors of its Laplacian matrix [7]. Graph Laplacian matrices, L, [41] are the core of Spectral algorithm, demonstrated in (7).
where, Y is a 'Adjacency matrix' having Yij ≥ 0 of graph R. D is the 'Diagonal matrix' of Y. A normalised class of Laplacian matrix, Lxy, is often defined as in (8), where the points in the Diagonal matrix are denoted by d.

C.4. K-modes: K-modes extends K-means by the application of Modes in place of
Means for clustering, in addition a rather straightforward matching dissimilarity method for categorical objects as used. K-modes also attempts to minimize the cost function associated with clustering through a frequency-based procedure in a view to updating the modes [42].
A dissimilarity measure for A 1 and A 2 , two ndimensional vectors, can be generated using (9). The number of total mismatch is inversely proportional to the degree of similarity between A 1 and A 2 . This dissimilarity d is (9)

C.5. OPTICS (Ordering Points To Identify Cluster
Structure): OPTICS is another useful clustering algorithms (Density-based). It is similar to DBSCAN in its functionality. OPTICS is able to efficiently handle clusters of varying sizes. The algorithm firstcalculates an 'Ordering' of all objects in the inputted dataset. The process is continued by identifying core samples of high density and then expanding these into the necessary number of clusters. For each of the points or objects within that dataset, an appropriate 'Reachability distance' and core-distance are saved. The reachability distance, r_d, of a point q in reference to another point p is the shortest distance from p if p is considered a core object. Core objects are the ones that have dense neighbourhoods. Generally p is considered a core point or object if, within its ε-neighbourhood = d e , at least min_pts points can be detected, including itself. p cannot also be smaller than the core distance, c_d, of q as pointed out in (10) [43]. Epsilon, ε, denotes the maximum distance that is considered, whereas min_pts indicates the minimum number of points required for the formation of a cluster [7]. OPTICS maintains a linear list that renders the densitybased clustering structure of the data, and is often called OrderSeeds, to generate the resulting ordering. In this list, objects are sorted according to the reachability-distance with respect to their respective closest core points or objects.

D. CLUSTERS PRODUCED BY THE ALGORITHMS
Before the application the clustering algorithms to our dataset, Hopkins Statistic [44] were applied to evaluate the probable existence of relevant, cluster-like non-random regions within the data. It will indicate whether the dataset truly contains meaningful clusters. The algorithm reported a probability of 94.44%, indicating high chance of relevant clusters being present.
In an ideal scenario, where an algorithm produces 100% correct results, it would generate clusters as shown in Fig. 10. As can be seen in Fig. 10, all the data points within the dataset constitute perfect clusters (no overlapping is observed). That is, no data points representing ham have been misclassified as spam emails. In the context of the 3D shape of the figure, we can say that 'data points are not heading upward from bottom plane to z1 zone (marked in red dotted box)'. Also, no spam emails have been misclassified as ham or, in the context of the 3D shape of the figure, 'data points are not heading downward to the bottom plane at z0 zone'. In reality however, it will not be as accurate or perfect as this as there will always be some degree of misclassification, and probable mixing of data points from the clusters.
A visual inspection of the algorithms' clustering patterns of our dataset is included in the next section. Algorithms that tend to get closer to the ideal clustering illustrated above, indicate better performance. Figure 11 visualizes the clustering produced by K-modes. It is evident that a high number of data points are placed VOLUME XX, 2017 into the regions of misclassification for both ham and spam emails. In fact the misclassification rate is higher for ham than that for spam emails. Thus the clustering show mixed to poor output, which is not acceptable.

D.2. Clusters produced by K-means:
The clustering structures effected by K-means can be seen in Fig. 12. The resulting clusters, as can be observed, have an extremely close similarity with the ones produced by Kmodes. Therefore the performance for K-means is not at the level of reasonable satisfaction either.

D.3. Clusters Produced By Spectral:
Spectral clustering, as picturised in Fig. 13, also failed to achieve noticeable improvement over K-modes and Kmeans. In fact the misclassification rate is higher for both ham and spam email than for the previous two algorithms. VOLUME XX, 2017

D.4. Clusters Produced By Optics:
Clustering produced by OPTICS, as depicted in Fig. 14, clearly was more accurate and robust ones than the previous three algorithms. For both ham and spam emails, the misclassification rates are considerably lower, indicating an improved quality of the clusters. The misclassification rate for spam emails has fallen sharply in this instance.

D.5. Clusters Produced By Dbscan:
Clusters produced by DBSCAN, as portrayed in Fig. 15, are almost identical to those of OPTICS, though the misclassification rate is slightly higher. However, the performance is considerably better than the rest of the algorithms.
The aforementioned cluster visualisation and the related discussion clarifies the fact that both OPTICS and DBSCAN performed well, while the K-modes, K-means and Spectral did not. However, to reach a quantifiable conclusion with, we will be carrying out other validation procedures in the following section.
Python's scikit-learn's [45] have been used for the implementation of both clustering and validation purposes.

E. CLUSTER VALIDATION
To have an objective knowledge into the algorithms' performance and to quantify the 'Goodness' of the resulting clusters, an array of highly relevant validation measures has been applied. We have employed both Internal and External validation methods: •

Internal:
Internal validation techniques evaluate how closely data points are positioned to each other inside the same cluster, also known as 'Compactness' [46]. In addition, how strongly a pair of data points is connected to each other within the same cluster compared to other immediate data points placed outside the cluster is also determined. This is often known as 'Connectedness'. Such validations do not require any previous ground-truths or cluster labelling.
Clusters showing minimal 'Connectedness' and 'Compactness' are considered to have been well-formed. •

External:
External validation techniques measure the extent to which cluster labels match the externally supplied class labels [47]. Due to the custom-built nature of our dataset, we havethe option of using external measures as the class labels were available. However, except for the validation purposes outlined in this section, these class labels have not been applied in any other processes . Validation methods shown in Table 6 have been used:

E.1. Internal Validation:
In this section, we will examine, using several internal metrics, how well the clusters have been formed.

E.1.a. Davies-Bouldin Index (DBI):
The Davies-Boulding Index (DBI) metric validates the algorithms on the basis of the ratio of within-cluster distances to between-cluster distances [48]. Better clustering is indicated by smaller outcome. In this study we have employed the reverse of Davies-Bouldin Index i.e.
(2− Davies-Bouldin Index, as the integer part of the largest DBI reported is '2'). As it reverses the direction of the index it provides more consistency with other indices evaluated in this work, without compromising the overall outcome. DBI can be measured for any value of n_cluster (nc) using (11) Fig. 16 it can be observed that the metric considers DBSCAN and OPTICS to be 'almost' similar or nearly equal performers while K-modes and K-means are considered as average at best, as they were far behind OPTICS and DBSCAN in terms of cluster quality, followed by Spectral which also performed poorly. This agrees with the knowledge that we have attained so far through the Scatterplot visualisation (section 7.4).

E.1.b. Calinski-Harabasz Index (CHI):
CHI compares the average between-and within cluster sum of squares [50] to report an evaluation on the cluster validity [50]. A higher value indicates better clustering. The index, CHk, can be defined as in (12) [51], where Xb is the between-cluster variance, and Xw stands for withincluster variance. The total number of clusters is denoted by k and N is the total number of observations.
Results returned by this index might be rather large, as it does not have any maximum upper bounds. We have confined the metric outcomes within 0 and 1 in this research to maintain conformity with other indices. This alteration may have marginally reduced the differences among various algorithms, but the original trend is preserved nonetheless.
The modification involves a minimal variation of Sigmoid function, as depicted in (13). For each output y, a corresponding value k is obtained after squashing such that {k: k > 0 and k < 1}. n is the length of the integer segment of the highest value as produced by the Index.

FIGURE 17. VALIDATION RESULTS FOR CALINSKI-HARABASZ INDEX
The index (Fig. 17) indicates that the algorithms have performed almost equally, with K-means and K-modes slightly ahead of other algorithms. However, in reality, the differentiation or gap between the algorithms (especially DBSCAN and OPTICS to the rest) more substantial than what appears in the figure. However the trend does not deviate much.

E.1.c. Silhouette Coefficient Score:
A widely accepted validation technique. The Silhouette Coefficient score, s, can be obtained for each of the samples using the intra-cluster distance (mean of within-cluster distance) a and the mean nearest-cluster distance b, using (14) [52].
where, b denotes the distance between a sample and the closest cluster to which the sample does not belong.

FIGURE 18. VALIDATION RESULTS FOR SILHOUETTE COEFFICIENT SCORE
The Silhouette Coefficient Score, as shown in Fig. 18, projects a rather contradictory picture to that of the scatterplots as K-means and K-modes are identified as the top performing algorithms whereas OPTICS and DBSCAN had disproportionately substandard results. Figure 19 tabulates a summarised picture of the outcomes of the internal validations, adding up the positions for each of the algorithms across the validation charts ( Fig. 16 -Fig.  18). We can observe K-means and K-modes generally achieved a slightly better outcome than OPTICS, while DBSCAN and Spectral a further behind. Clusters produced by Spectral were judged as below par by all the Internal validation metrics.

FIGURE 19. A SUMMARY OF INTERNAL VALIDATION OUTCOMES
Now, to have a complete picture, a range of External validation techniques will be explored.

E.2. External Validation:
In this section, we will examine how well the clusters have been formed using external validation metrics.

E.2.a. Adjusted Rand Index (ARI):
The Rand Index (RI) determines a similarity score between two sets of clustering by considering each of the pairs of the provided samples and summing up pairs that are assigned in identical or different clusters in the true clustering in addition to the predicted ones [53]. The raw RI score is then 'adjusted for chance' into the ARI using (15). Scores towards +1 indicate better clustering. The index as depicted in Fig. 20, clearly projects the superior quality of clusters produced by OPTICS and DBSCAN, whereas the rest, according to ARI, did not show any commendable outcome.

E.2.b. Adjusted Mutual Information (AMI):
The AMI metric quantifies the extent of information the two clusters under examination have in common; often referred as 'Correlation Measure'. The MI score is then 'adjusted for chance' to get the AMI [54]. The AMI of two clusters C1 and C2, is found using (16), where T is the Entropy. Scores approaching +1 indicate better clustering.

E.2.c. V-measure:
V-measure relies on the principle of conditional entropy analysis which is a measurement of the extent of disorder within the cluster. V-measure works out the Harmonic Mean of Completeness -whether all of the members of the class in question are allocated to the same cluster and Homogeneity -a measure of a cluster containing only members of a specific cluster [55]. Homogeneity and Completeness are two critical characteristics of a cluster. V-measure, @ is given in (17). OE signifies the degree of weightage given to each of these two characteristics, and in this case it is '1' (equal weightage). @ = + OE × completeness × homogeneity OE × completeness + homogeneity … . Ž

FIGURE 22. VALIDATION RESULTS FOR V-MEASURE
Results of the V-measure shown in Fig. 22 demonstrate the similarity to AMI with OPTICS and DBSCAN having better results than the rest.

E.2.d. Fowlkes-Mallows Index (FMI):
Another widely accepted external validation metric is the Fowlkes-Mallows Index (FMI). If α is the percentage of correctly clustered data points, and β is the percentage of data points (in pairs) correctly allocated in the same cluster, then their geometric mean, µ, will account for FMI [56], i.e. µ = ] . OE .

FIGURE 23. VALIDATION RESULTS FOR FOWLKES-MALLOWS INDEX
FMI validates OPTICS and DBSCAN to be the top performers similar to the previous indices. However, the FMI also indicates that the clustering quality of other algorithms not much less than OPTICS and DBSCAN.

E.2.e. Purity:
Purity is a transparent and straightforward external validation metric. Often regarded as the 'Cluster Accuracy', t it has gained widespread acceptance as an indicator of the quality of the clusters produced by clustering algorithms. Scores approaching towards +1 suggest better clustering [57]. Fig. 21 charts the measures of purity for each algorithm. VOLUME XX, 2017  Figure 25 has a summary of the findings of the external performance metrics that have been used in this research.

FIGURE 25. SUMMARISED VIEW OF EXTERNAL VALIDATION OUTCOMES
OPTICS has the top performance in all the external evaluation measures, while DBSCAN was a close second. The other three algorithms produced clusters of lesser quality. In fact, the trend depicted by the external metrics are the same across all the five evaluations techniques.

E.3. Discussion on the Internal and External Validation Outcomes
The figure below (Fig. 26) depicts the cumulative positions of each of the clustering algorithms, from Davies-Bouldin Index to the Purity metric. Fig. 26 clearly shows that OPTICS had an uncontested clustering 'goodness' for the most part of the tests; while DBSCAN was relatively close.. The remaining three, failed to provide comparable performance to the top two, OPTICS and DBSCAN, though K-modes seemed occasionally promising. Based on all validation metrics' scores, on an average, clusters delivered by OPTICS had approximately a 0.26% better performance than DBSCAN. It can be concluded that both OPTICS and DBSCAN produced meaningful clusters that align strongly with the aim of this research. A Heatmap is generated in Fig. 27 to show the clusters created in terms of all the evaluation metrics in an easy-tocomprehend visualisation.

E.4. Evaluation based on Class Detection Rate
The validation techniques and scatterplots discussed so far, impart a convincing view on the clustering performance of the algorithms. In this section a more granular approach will be examined to investigate how well each of the clusters have been formed irrespective of the other through the application of metrics like Accuracy and F-Score [58]. Due to the availability of the original class labels, we can further evaluate the clusters using these two metrics. Figure 28 shows the Balanced Accuracy achieved by each of the algorithms, where OPTICS and DBSCAN demonstrated acceptable performance. This is an overall evaluation that confirms the findings of the previous validation procedures.   Figure 29 reports the F-Scores, for both ham and spam emails separately. It is evident that OPTICS and DBSCAN performed better than the other alsgorithms. In addition the F-Scores for spam emails are higher (though not that impressive in most instances) than for ham in all cases except for K-modes, where the F-Score for ham is slightly higher than for spam emails. Generally F-Scores are good at indicating how precise and robust the individual clustering (classifiers for supervised model) are. The critical takeaway from the analysis of this section is that, though the proposition delivers encouraging results and interesting findings, there is still room for improvement in clustering quality, especially for Ham.

E.5. A Comparison with Closely Related Studies
As tabulated in Table 7, our research has been compared to some of the closely related works that have used, at least partially, the concept of unsupervised learning in the segregation of ham and spam emails. Note that, as has been mentioned before, the availability of works that are largely similar to ours, is virtually nil. As described above, DBSCAN and OPTICS demonstrated strong performance> However, we found almost no relevant research that investigated any mainstream clustering algorithms except for K-means. It is therefore difficult to draw a decisive conclusion from the comparison.

VIII. CONCLUSION AND FUTURE WORK
This research described a novel framework based entirely on unsupervised methodologies to separate ham from spam emails through unsupervised clustering. The process started with the formation of a raw dataset of several features of varying characteristics, based on the email's body content and subject header. Some of these features have not been used in earlier research with similar aims. Some features were completely novel and engineered after carefully examining the content from various angles; while other features used in this research have also been used in earlier research in a different form. The dataset was then converted into binary form and feature selection algorithms (MCFS and Laplacian) were introduced to remove low-impact features if possible through feature reduction algorithms. The feature that contributed the least was identified and consequently left out. The resulting dataset (containing both ham and spam emails) is now publicly available to download from github [62]. We believe this dataset can be a useful source for other relevant research.
Afterwards a range of unsupervised algorithms-Kmeans, K-modes, Spectral, DBSCAN and OPTICS were used to cluster the dataset to create clusters of ham and spam emails. An number of internal and external validation processes were then applied to the clusters to measure and quantify the quality and usefulness. The findings show that OPTICS and DBSCAN produce the best quality clusters, whereas the other algorithms, though some displayed sporadic promise, were not optimal. The clusters were further evaluated through metrics such as F-Score.
Differentiating ham from spam emails using 'only' unsupervised methods, acting upon the email's 'subject' field and its body content, is a novel approach. A range of novel features have also been engineered and sound feature selection methods have been applied to use only the features having sufficient degree of impact. Overall a number of novel avenues have been explored to address the significant research gaps in this field of study.
In future endeavors, we aim to further work on the novel features to improve the quality of clustering, especially for ham. We also intend to combine the framework presented here with our earlier work that deals with unsupervised clustering using Header fields only. This will provide us with a completely unsupervised system that can be expected to increase the detection rates and clustering quality using all the major parts of an email . The best performing clustering algorithms from this study and our earlier studies will be used in future work. The results will be validated using all standard evaluation processes available. Our datasets will also be made publicly downloadable for further use.

APPENDIX C
An extract of spam words and phrases from [27].