Efficient Clustering of Emails Into Spam and Ham: The Foundational Study of a Comprehensive Unsupervised Framework

The spread and adoption of spam emails in malicious activities like information and identity theft, malware propagation, monetary and reputational damage etc. are on the rise with increased effectiveness and diversification. Without doubt these criminal acts endanger the privacy of many users and businesses’. Several research initiatives have taken place to address the issue with no complete solution until now; and we believe an intelligent and automated methodology should be the way forward to tackle the challenges. However, till date limited studies have been conducted on the applications of purely unsupervised frameworks and algorithms in tackling the problem. To explore and investigate the possibilities, we intend to propose an anti-spam framework that fully relies on unsupervised methodologies through a multi-algorithm clustering approach. This article presents an in-depth analysis on the methodologies of the first component of the framework, examining only the domain and header related information found in email headers. A novel method of feature reduction using an ensemble of ‘unsupervised’ feature selection algorithms has also been investigated in this study. In addition, a comprehensive novel dataset of 100,000 records of ham and spam emails has been developed and used as the data source. Key findings are summarized as follows: I) out of six different clustering algorithms used – Spectral and K-means demonstrated acceptable performance while OPTICS projected the optimum clustering with an average of 3.5% better efficiency than Spectral and K-means, validated through a range of validations processes II) The other three algorithms- BIRCH, HDBSCAN and K-modes, did not fare well enough. III) The average balanced accuracy for the optimum three algorithms has been found to be ≈94.91%, and IV) The proposed feature reduction framework achieved its goal with high confidence.


I. INTRODUCTION
Email spamming can be defined as the act of distributing unsolicited messages, oftentimes sent in bulk using email. Emails, sent for legitimate purposes, are known as Ham [1]. Spammers use the act of spamming for not only marketing purposes, but also to achieve more malicious goals such as reputational damage and financial disruption, both in institutional and personal front. Emails are still considered the primary choice for the scammers when comes to delivering malware. Financial gain is one of the main motivation for The associate editor coordinating the review of this manuscript and approving it for publication was Thomas Canhao Xu . the spammers. Estimation is that spammers may earn around USD 3.5 million yearly from spamming [2].
By the end of 2019, there were nearly 4 billion active email accounts worldwide [3]. In fact in 2019, approximately 294 billion emails have been exchanged daily, 50% of which were just spams [4]. Needless to say, this substantial volume of spam emails circulating through a public network like internet is continually having a damaging and costly footprints on the communication bandwidth, available memory on email servers and CPU cycles, in addition millions of everyday users' time and patience in dealing with these spam emails. In a recent report, FBI stated that malicious spamming has incurred a financial damage of USD 12.5 Billion to business email consumers in 2018 [5]. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ The United States in general is known to be the largest source of spam emails, however, in recent times other countries often outnumber USA in originating spam emails. As of April 2019, Russia and Brazil have surpassed USA and China (another notable spam email producing country), to produce approximately 14% and 16% of total volume of worldwide spam respectively [6]. Though there were legislations such as CAN-SPAM (Controlling the Assault of Non-Solicited Pornography and Marketing Act) to protect the users, it did not achieve the expected deterrent effect on the spammers [7]. World's top 70% spam gangs, responsible for coordinated worldwide spamming, have their roots in USA [2]. Meanwhile in Oceania, recent reports indicate Australian businesses and consumers already lost nearly AUD 28,375,373 due to email fraud by the end of 2019 [8].
The contributions of this research can be summarized in I) Developing a comprehensive novel database from multiple email corpuses that may be universally used for any number of related research, II) Investigating the effect of unsupervised clustering of only the domain and header information of both ham and spam emails excluding the subject header, III) Employing a robust unsupervised feature reduction algorithm for dimensionality reduction, and IV) Lay the pathway for subsequent research and development of the complete framework.

A. SCOPE AND MOTIVATION OF THE RESEARCH
A number of research initiatives in this field using supervised, and a combination of supervised and unsupervised machine learning approaches have already been undertaken and some of those are quite extensive. However, until now, very limited work has been carried out based solely on unsupervised methodologies. Moreover, the type of analysis, mostly supervised, has been found to be revolving around the subject and content of the email, but header and domain information, as well as the presence of underlying scripts within the emails have not been investigated to any great depth, especially using purely unsupervised methods. However, as we know, the problem domain of ham and spam email is not constrained only within a single facet of the email subsystem, but rather all the sub-parts need evaluation and investigation. This research of ours is a comprehensive attempt at resolving this vacuum. Collecting and managing labelled data for supervised learning algorithms is often quite complex and expensive [9], whereas unsupervised clustering can work with unlabeled data. Though it is vitally important to dissect all the parts of an email including domain, header and content while building anti-spam models [10], this study will look into the effect of unsupervised clustering of only the domain and header information of both ham and spam emails excluding the subject header.
Additionally, a key gap in many of the existing works that we have inspected, is that those do not clearly state why certain features have been preferred over other features, or why some other features have been left out altogether. We wanted to avoid such dilemma and thus came up with a novel unsu-pervised feature reduction mechanism that can confidently indicate the non-significant feature set that can be left out of the analysis. Some of the other studies did in fact use feature selection processes but we believe the technique developed by us provides increased confidence on the usefulness and impact of the significant features.
In future studies, the content, scripting information and the subject header will also be thoroughly examined under the umbrella of purely unsupervised methodologies. The knowledge gained from this study and the subsequent ones will be hybridised to generate the complete spam filtering solution. This study is a critical and novel undertaking as we have found no previous studies to exactly match the objectives of this research.

II. RELEVANT STUDIES
Research initiatives in the field of unsupervised clustering of emails into spam and ham purely using header and domain information are rather scant, despite that, the following section sheds light on some of the closely related works.
The framework introduced by Smadi et al. [11], named, 'Phishing Email Detection System (PEDS)' uses unsupervised clustering, in conjunction with both supervised and reinforcement learning techniques [12]. Such amalgamation equips the system with an enhanced capability to modify itself based on the identified modifications and changes in the environment. The proposed model analyzes a number of header and domain information such as MessageIDs, Sender's Domain, email's content class, whether the email is multi-part, number of receivers and attachments, reply address etc. The system mainly aims at tackling Zero-Day Phishing attacks [13]. Based on the environmental parameters, the heart of the system-'Feature Evaluation and Reduction (FEaR)' algorithm, can rank and select the critical features from emails dynamically. FEaR is based on Regression Tree (RT) algorithm. Immediately after the execution of FEaR, another novel algorithm, DENNuRL (Dynamic Evolving Neural Network using Reinforcement Learning) lets the core Three-Layer Neural Network of PEDS to evolve dynamically and stitch together an optimum Neural Network. Though the achieved accuracy rate is 99.05%, some of the features employed such as 'NumLinkNonASCII', 'BodyDearWord', 'BodyNumChars', 'ContainScript', and 'BodyNumWords' are rather not that conventional as the degree of impact of these features have not been discussed. Authors have not presented the logic behind the inclusion of these, leaving scope for further analysis. Inclusion of some of the critical domain features such as source IP and age of domain have also been ignored.
A critical issue in literature and historic documents that has gained much attention is the determination or attribution of 'Authorship'. Researchers have developed strategies, such as identifying patterns pertaining to stylistic, syntactic and grammatical features available in such documents, to successfully identify and group original authors of such documents. Despite emails being highly unstructured, Alazab et al. [14] attempted to introduce the idea of 'Authorship Attribution' for phishing campaign identification. The authors have deployed an Unsupervised Automated Natural Cluster Ensemble (NUANCE) methodology to achieve approximate clustering of spam emails. 57 stylistic features have been used by the researchers such as, for instancetotal word count of the text of the email, total count of the punctuations used in the email body, total number of contractions present in the email, total number of URLs present in the body of the email, total number of obfuscated words present in the email etc. Semantic features along with its combination with stylistic features have also been considered. Semantic features aid is explaining how words that share certain features may be members of the same domain. The eventual clusters are obtained by hierarchically clustering the approximate sets, churning out 27 different clusters. Even though the system is quite impressive and provides improved results in the general direction of 'authorship attribution' in spam campaign detection, however, the intra-dynamics (for instance spammers interchanging or borrowing different functionalities from each other) taking place within different campaign groups may often pass undetected. Though the work mainly focusses on elements found in 'Authorship Attribution' process, nevertheless, such research attempt may be positively improved by the addition of header and domain features.
In [15], the authors have developed a clustering solution to detect spam based on spamming campaigns. They have used FP-Tree (Frequent Pattern Tree) algorithm [16] to identify spam campaigns. Authors have chosen OrientDB (a NoSQL Database) to store the campaign spams. Features were extracted from these emails so that the FP-Tree can be built based on the frequency of occurrence of each of the features. Emails are then clustered into spam campaigns based on the similarity of extracted features. Several header features, for instance-content type, character set, subject etc. have been used. Features from other parts of an email have also been put to use. However, FP-Tree, used in this research for clustering purposes using different features of an email, is extra sensitive to even minor of changes in layout or feature structure. Such minimal changes will cluster spam emails from a similar campaign into two different campaigns.
In a technical report, Blanzieri and Bryl [17] discussed several aspects of different learning algorithms aimed at spam filtering. The paper highlighted a number of proposals to alter or modify email transmission protocols in a view to encompass techniques to reduce spam emails as much as possible. Some methods focused solely on message content while others combined header or subject with content.
Al-Saaidah in his thesis [18] combined both K-means clustering and other classification methodologies to increase the detection accuracy of phishing attacks [34]. Various header and domain features have been used such as subject header, To, From and Reply domain as well as the presence of suspicious java scripts. The result projected that the combination of classification and clustering provides slightly better detec-tion result than a standalone model, attaining an accuracy of 98.37%%. The project also used automated and manual feature selection techniques.
A limited amount of work has also been done on clustering spam and ham emails based on algorithms related to Artificial Immune Systems [19], such as Negative Selection Algorithm [20] and even custom-developed Genetic algorithms [21].
Under the domain of Unsupervised methodologies, ''selfsupervised learning'' is crucial, and has been widely used in research domains based on computer vision as well as for anomaly detection in several fields. Self-supervised learning formulates new surrogate labels artificially or extract robust feature set through the characteristics or structure of the unlabeled data itself [66]. However, the studies done on effective ham and spam email differentiation using self-supervised learning is rather scant. We did find several flavours of 'Autoencoders' (a class of Artificial Neural Network that generates efficient feature representation of inputted dataset in an unsupervised fashion [67]) have been used in some studies in a self-supervised fashion to function as part of the overall proposition (primarily supervised).
Mi et al. [68] in their findings have shown that Autoencoders used in a stacked ensemble can provide greater computational ability, better feature reconstruction and higher accuracy while detecting spam emails than other traditional supervised algorithms such as Support Vector Machine, Naïve Bayes, Random Forest, Multilayer Perceptron and few others. Their study is based on commonly available PU spam corpora [69]. In another study an improved variant of Stacked Autoencoders has been used to analyze spam emails in Chinese, yielding better performance than traditional methods [68]. Additionally, in their research [70], Douzi et al. used Autoencoders to automatically learn the hidden feature representation of URL(s) embedded within the email content in an unsupervised manner in a view to determine the possibility of phishing scams. The learnt feature representation can then be used as an input to supervised classifiers. Though the idea seems promising, but no actual experimentation results have been reported by the authors.
Martino et al. [75] in their paper proposed a content based multiclass spam email identification framework. Unsupervised hierarchical clustering along with some supervised classifiers in combination with TF-IDF have been used to develop the model. TF-IDF combined with SVM performed the best with an accuracy of 95.39%. However, the dataset used for training is heavily skewed towards one particular class, thus the testing needs to be done using more balanced dataset.
Fragos [76] performed a K-means clustering to achieve a 2-way classification (ham and spam emails). PCA was first used to lower the dimensionality of the data before the clustering steps. The results show a 'Recall' measure of 94.91% and 98.57% in detecting ham and spam emails. However, key limitation is that the solution does not produce any consistent result in each run and the authors have not clarified why some of the features were left out while others were selected for the study.
The approach that we have taken is unique and more comprehensive especially considering the above studies.

III. OUR APPROACH AT A GLANCE
As the name suggests, unsupervised learning refers to the fact that the model will not have any labelled data to work with, and thus no training will be provided; whereas supervised approaches have the downside of training over a large amount of manually and costly tagged email corpora [27]. Now based on the dataset, unsupervised algorithms generally attempts to figure out common features within a group of items and rearranges the data points in clusters based on the commonality [22]. It is also computationally efficient and less time consuming [23], [24] than supervised approaches. Apart from the usual 'distance based' clustering [25], where a certain distance metric such as 'Euclidean Distance [26]' is applied to determine similarity between data objects, 'density based' clustering is also useful in certain domains.
As can be observed from Fig. 1, our approach comprises several steps, at step I ( Fig. 1), we have sampled our preprocessed dataset X of 100,000 records and 10 features, storing spam and ham emails in the ratio of 2:1. Note that this novel dataset (X) is completely custom built from three other publicly available datasets and we will have a brief discussion on it in section IV. The sampled dataset S, houses 50,000 records such that S ⊂ X, where (67% spam and 33% ham) with all 10 original features intact.
The process of feature reduction is initiated at step II ( Fig. 1), where we proposed a different approach of selecting the most impactful features, which will be discussed in due course.
Step III (Fig. 1) deals with restructuring original dataset to reflect the output of the previous step; in addition, the dataset was broken down into two separate parts (containing 60% and 40% of the data respectively) and used in two different runs for clustering purposes. Afterwards, different clustering algorithms were applied in step IV (Fig. 1) on one of the subsets holding 60% of the data. Not all unsupervised algorithms can be tuned to produce exactly two clusters, so we had to select only those which can be parameterized to do so.
Step V (Fig. 1) incorporates validating the clusters obtained in step IV. We employed a range of internal and external measures, to the produced clusters to confidently quantify the performance and true detection rate of the algorithms.
At this stage, in step VI (Fig. 1), we were in a position to identify the top performing algorithm(s). Finally, at the last step, we repeated steps IV-VI to evaluate the findings obtained in the previous step, but with the other dataset holding the remaining 40% data. The result confirms whether the best performing algorithms found earlier in the first run indeed consistently perform on email header features in clustering emails into spam and ham.

IV. THE DATASET
Though there are number of pre-processed publicly available datasets on ham and spam emails, but we had few criterion to begin with that needed to be fulfilled. Such as: 1) A dataset of both email content and all the common header fields, 2) The size of the dataset needed to be sufficiently large, around 100,000 so that more realistic performance measure and nature of clustering can be obtained, and 3) Email dataset that is not confined within a specific geographical zone.
Unfortunately, the public datasets that are already in a ready-made state such as LingSpam, Hunter SpamBase, Spa-mAssassin and PUA to name a few, did not fulfill the above criterion -as those were either not of expected volume, not enough information relating to header and content or particularly linked to few specific geographical area. Thus we were left with no choice other than to build such a dataset on our own from the publicly available raw and non-curated email corpuses (available as text files). We did not really use any of our own email records for this research nor plan to use in the subsequent development of our proposition.
The seminal database of over half a million records was first created from three publicly available email collections (2017 and 2018 spam collection by Bruce Guenter [30], TREC [31] and Enron dataset [32]), containing both ham and spam emails. These archives store emails in sufficient volumes in textual format, including headers. This seminal database is pivotal to the whole framework as the subsequent investigations after this research will also use this raw data source for the formation of the required pre-processed datasets. The pre-processed one used in this study, has also been created from the above raw database and has 10 features and randomly selected 100,000 records of which 67,000 are spam emails and the rest is ham. A portion of this dataset can be seen in Table 1.
WHOIS information repository has also been frequented quite heavily to populate the seminal database for certain domain based features. We have extensively used Python 3.6 to code the data extraction and engineering algorithms as well for feature reduction, clustering and validation purposes. Some R packages have been utilized for visualisation. MySQL has been deployed at backend for data storage.
The header field 'Subject' has been left out in this research as we felt it is more suitable to be coupled with the content of the email, and should best be earmarked for a separate research initiative focusing on content analysis. The below discussion briefly highlights the features used.
1. Diff_FromDomActDom: Indicates whether the domain contained in the 'From' field is different to that of the actual originating domain -extracted from the earliest 'Received' field if that field does not contain values such as 'localhost' or 'localdomain'; in that case domain mentioned in the second-earliest 'Received' field is extracted.
2. BL_IP: Identifies whether the source IP of originating domain has been 'Blacklisted' by a number of reputable spam reporting watchdogs such as spamhaus, barracuda, sorbs etc. to name a few.
3. Created_within_1_year: Indicates whether the originating domain has been set up within the previous 12 months from the date of the email. The mail date can be extracted from the header.

Expire_in_13_months:
Points out whether the originating domain will expire in 13 months from its creation date. It has been reported in some experiments that spammers most commonly register spamming domains anywhere between five to around a year [28], [29].
If e is the set describing possible values for the above four features then e = {0 (false), 1(true)} as these features are Boolean in nature (stored as integers, 0 and 1).

Mail_dt:
Date of the email sent, as mentioned before, can be extracted from email headers. 10. Hop: A count of how many mail exchange(s) the email had to pass through before reaching the destination. The type of value for this feature is integer.
There are some header features, for instance, 'return_path', that may or may not have an impact, but we had to leave those out as only the features common to all emails generated by any mail server or email client have been included in this study.
Henceforth all the features starting from Diff_FromDomActDom till the last one, hop, will sequentially be referred to as Appendix A has more on data construction.

V. THE COMPLETE WORKFLOW IN DETAIL
In this section, the detail discussion on each of the subprocess of the complete framework will be carried out.

A. SAMPLING
In reference to Step I of Fig. 2, the sample dataset S, has been formulated from the complete custom-built numeric dataset of 100,000 records with 10 features as discussed above, having 67% of spam emails and the rest ham. The feature vector and the ratio of spam emails to ham in S is same as the original dataset X. The purpose of sampling is to carry out the process of feature reduction through multiple unsupervised feature selection algorithms. Executing the feature selection algorithms on a dataset of 100,000 records with 10 features required a matrix manipulation of (100,000 × 10) data points which is rather infeasible and unscalable from hardware perspective required for such computation.
However, computation of a 50% sampling of X to S, still requires considerable hardware capacity. To address that we deployed one of Amazon's AWS servers.

B. FEATURE REDUCTION THROUGH FEATURE ELIMINATION
With a large number of features, or attributes, model construction often becomes problematic due to several issues, such as Curse of Dimensionality [33], extended training time, overfitting etc. Feature Reduction\Selection tries to overcome these issues by logically selecting only those features which will have the most determining effect on the final output. Our feature vector, however, due to effective pre-processing and feature engineering, was already in a rather manageable state having 10 features only. However, we have sought to reduce it further by employing a novel ensemble feature reduction process-pictorially illustrated in Step II of Fig. 2.
The process deploys three Unsupervised Feature Selection algorithms; Principal Component Analysis (PCA), Laplacian Score for Feature Selection and Multi-Cluster-based Feature Selection (MCFS). There were few other options but we found those unscalable to reasonably large datasets. Before initiating the discussion on the proposed feature reduction VOLUME 8, 2020  technique, we will have a brief and lucid discussion on the abovementioned three algorithms: Principal Component Analysis (PCA): PCA is an unsupervised framework that works extremely well in most cases for 'Dimensionality Reduction' in such a fashion where the maximum variations of the dataset can be retained [35]. PCA is also a valuable tool in building Predictive Models. The system is an 'Orthogonal Linear Transformation' that transmutes the normalised inputted data to a new coordinate system [36]. To begin with, the labels are stripped off and the dataset is put into a Matrix X. The 'Mean' -X is then calculated. Now the dataset is considered in its original form without stripping off the labels and 'Covariance Matrix' is calculated using (1) (showing Covariance of two variables P and Q that can be used to calculate Covariance of Matrix X).

Cov(P, Q)
The corresponding 'Eigenvalues' is then derived from the 'Eigenvectors', calculated from the Covariance Matrix of the original dataset and the value of k can be obtained by sorting the Eigenvectors by descending Eigenvalues, and taking k largest Eigenvectors where k is the number of dimensions of the newly obtained feature subspace and k <= d. Subsequently the projection matrix is fashioned from k Eigenvectors through which the original dataset X is transformed to obtain a new k -dimensional feature subspace. In this research, PCA has been used before MCFS and Laplacian to get the cardinality of the feature-set holding the least impactful features.
Multi-Cluster-Based Feature Selection (MCFS): MCFS selects a subset of the original feature-set based on the optimisation over an 'L1−regularized least-squares' problem [37].
A key aspect of the algorithm is its ability to maintain the multi-cluster structure of the data. Determining the correlations between different features is carried out by spectral analysis without any corresponding labels. The spectral analysis usually clusters the data points using the top eigenvectors of graph Laplacian (discussed more in the 'Spectral Clustering' section). MCFS calculates the linear reflection of low-dimension representation of high-dimension features by resolving the L1− regularized regression problem [37] as shown in (2).
In (2), P is the 'flat' embedding for the data points where P = [p 1 , p 2 , . . . ., p k ], a k is the N-dimensional vector and Eventually the most useful features are selected having the maximum coefficient of sparse representation and assigning a corresponding score called MCFS score. For every feature m, the corresponding MCFS score, C, is attributed using (3), where a k,m is the m th member of vector a k .
All the features are then sorted in descending order on the basis of C. This algorithm is quite useful while the number of features is less than fifty [38]. Laplacian Score For Feature Selection: The algorithm works on the belief that data residing in the same class are often close to each other; thus importance of a feature can be determined by its power of locality preservation. The algorithm starts off by embedding the data on a nearest neighbor graph G having n nodes. The i th node represents the element x i . The graph makes a connection to x i with another element or node x j , belonging to k nearest neighbors of x i . The Weight Matrix, W of G describes the local structure of the data space and is defined using (4).
c is an appropriately chosen constant, and a graph Laplacian, L, is constructed from W (graph Laplacians are discussed in the 'Spectral Clustering' section). A Laplacian score, LS, is then calculated for each feature using (5) and ranked accordingly [39]. Equation (5)

1) PROPOSED METHOD FOR FEATURE REDCUTION
As has been mentioned before we will be using PCA first to get the set of least important features spread within the number of principal components that will represent the majority of the sample dataset S, in addition, the cardinality of that set, n is also important as for MCFS and Laplacian Score, we will take n number of least important features to formulate the sets corresponding to those two algorithms. From Pseudocode 1 it is evident that three sets of least impactful features have been identified using the three algorithms and the set R (containing the most useful features for clustering purposes) has been derived such that it holds all the features from the original feature-set of 10 excluding those found in common within the three sets P, M and L of least

2) RESULT EVALUATION OF FEATURE SELECTION ALGORITHMS
In this section we will detail out the results obtained after each of the algorithms has been applied to dataset S.

a: PRINCIPAL COMPONENT ANALYSIS (PCA)
In almost universally, in case of PCA, the first two principal components (PC1 and PC2) represent or explain majority of the data (over 50%). Because of our ensemble based approach for feature selection, we designed the approach giving as much flexibility as possible in using PCA, as this flexibility or room is required for the functioning of the other two algorithms. Eventually, the least important features will anyway be identified with confidence and will be left out. Our approach has been to use the first two Principal Components as long as those can cumulatively explain at least 40% of the data. In case it falls below 40% (in rare instances), we will drop PCA altogether and will only use MSFC and Laplacian Feature Selection to look for consensus on the least two significant features; that is the value of n will be 2 to begin with. Now provided that at least 40% of the data can be represented by PC1 and PC2, those feature(s) having cumulative 'weight' across PC1 and PC2 falling below a preset threshold value of 15% (or 0.15), will be identified as 'least important' (PCA only). This threshold value of 15% has been established keeping in mind that there should be a 'degree of freedom'as we may get tempted to choose a value that is too low (for instance < 10%). This will render the recommendations of MCFS and Laplacian Score rather ineffective. The cardinality of the three sets in that case may become rather tight, and after commonality comparison, the process may leave out feature(s) which is/are in fact 'not effective' from the final set of least important feature(s). Our aim is to eliminate those low-ranked feature(s) which are deemed non-essential across multiple feature selection algorithms; thereby giving a high degree of confidence in R, the set of most useful features. Obviously, for this hypothesis to be acceptable, we will have to evaluate how the clustering algorithms used in this research respond to R.
Once the PCA has been applied, it can be observed from Fig. 4 that the first two principal components (PC1 and PC2) account for nearly 50% of the data. The Biplot (Fig. 3) also suggests that f 4 , f 6 and f 7 have rather a minor variability across PC1 and PC2, and carry insignificant 'Weight'.   Table 2 projects the PCA-produced 'Weight Matrix' for all 10 features. It can clearly be observed that f 4 , f 6 and f 7 indeed have very low cumulative weights (0.01, 0.03 and 0.00). Besides, f 8 also seem to have a cumulative weightage, (PC1 + PC2), of only 0.14, thus failing to push through the threshold value of 0.15. We therefore consider these four features to be included in the set of least impactful features, P, obtained from PCA. Therefore, P = {f 6 , f 4 , f 7 , f 8 } and n = |P| or 4.
Thus for Laplacian Score and MCFS, the last ranked n number of features or four (4) features will be selected as the elements for the sets of least important features respectively. Table 2 lists the features from most important (f 3 ) to the least important (f 5 ) according to the ranking based upon the scores assigned to each feature.

b: LAPLACIAN FEATURE SCORE
As we can see from Table 3, the lowest ranking four (n) features are f 4 , f 7 , f 6 , and f 5 . Therefore in the set L of least impactful features for Laplacian Feature Score, the elements will be L = {f 4 , f 7 , f 6 , f 5 }.  Another interesting aspect for this instance can be observed by plotting a simple line plot of Laplacian Scores (Fig. 4) where the more impactful features are almost equally contributing as those are tightly fitted in a cluster; while the less important features are rather distant and consequently the levels of impact are quite minimal in comparison to the cluster of useful features.  As we can see from Table 4, the lowest ranking four (n) features are f 5 , f 6 , f 2 , and f 4 . Therefore in the set L of least impactful features for MCFS, the elements will be M = {f 5 , f 6 , f 2 , f 4 }. If we consider C to be the set of all ten (10) features, then the set of most useful features, R, is derived as per (6): Thus features f 4 and f 6 have been identified by all three algorithms as less impactful features, and we can confidently express that the set of R indeed contains smallest feature subset and all the elements (features) in R is carrying significant degree of contributing weights that can aid in revealing useful clusters. The cardinality of R or |R| = 8, so we have managed to achieve a 20% reduction from the original feature vector (C) of cardinality 10, retaining 80% of the most useful features. f 4 and f 6 in the original feature vector stand for 'date of email' and 'Message-ID' respectively. We will now have to restructure our original dataset X, leaving these two less important features out.

C. RESHAPING THE ORIGINAL DATASET
The bonafide dataset, X, as shown in Fig. 5, at this stage has been transformed into X r , having the feature-set of R. So instead of the 10 features, it now has 8 of the most critical ones. Afterwards, X r has been sliced into two separate datasets, X r P1 and X r P2 . X r P1 contains 60% or 60,000 data rows inclusive of ham and spam emails (1:2) and X r P2 houses rest of the 40% data with the same ratio.
Both have the same feature vector R. The reason we have decided to separate the datasets as once the clustering algorithms have clustered dataset X r P1 , and results are being validated through rigorous measures, it is important to evaluate the degree the consistency of this validated clustering results across another different set of data. If there is significant consistency projected by any of the algorithms, we can reach a decision on the performance of the algorithm with a high degree of confidence.

D. APPLYING UNSUPERVISED CLUSTERING ALGORITHMS TO CREATE THE CLUSTERS
This section will briefly discuss the algorithms that have been used for clustering purposes (Fig. 7). As mentioned before, only those algorithms where the number of clusters created can be controlled, have been deployed for the model. In the subsequent sub-sections, the resulting clusters will be investigated through 3D Scatterplot visualisation.

1) CLUSTERING ALGORITHMS USED a: BIRCH (BALANCED ITERATIVE REDUCING AND CLUSTERING USING HIERARCHIES)
BIRCH is one of those very few unsupervised clustering algorithms that can cluster substantially large datasets with available (and often limited) resources such as 'main memory' of a processing unit; In general, BIRCH incrementally and dynamically processes an un-clustered dataset of multidimensional metric data points in a rather time-efficient manner in one single scan [40]. The clustering quality is often thereafter improved through some additional scans. The algorithms is said to effectively handle data points that are not really part of the underlying patterns (often denoted as 'Noise'). To handle large datasets, BIRCH first calculates a 'Triple' entries for data points known as 'Clustering Feature (CF)', then dynamically building a tree of CFs (CF-Tree). Given − → X j [40]. A CF may also be composed of multiple other CFs. CF-Tree is a compact summary of the complete dataset and retains maximum distribution information of the data. Subsequent incremental clustering is then carried out on this summary representation instead of the original dataset [40].

b: HDBSCAN (HIERARCHICAL DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH NOISE)
An extension and improved version of DBSCAN algorithm where 'Density based Clustering' technique has been preferred over centroid-based clustering as in K-means. Density based clustering is an unsupervised learning technique that recognises distinctive clusters in the data, on the assumption that a cluster in a data space is a contiguous region of high point density, alienated from other such dense areas of clusters by contiguous areas of low point density [41].
HDBSCAN first gets a rough estimate of 'density' and then 'pushes away' the points in low dense areas further from not only each other, but also regions of high density. 'Mutual Reachability Distance (MRD)', shown in (7) is used to achieve the purpose.
x mrd-k (p, q) = max{core k (p), core k (q), x(p, q)} core k (p) or the 'Core Distance' is measured for parameter k for a point p as distance that is required to travel from each point to the defined minimum number of points for a cluster. Fig. 6 shows the concept of Core Distance for k = 4.
So, if a 'large' minimum points per cluster is selected, then the corresponding 'core distance' will also be larger. x(p, q) is the original metric distance between p and q. Now the dense points having low core distance will remain the same distance apart from each other but sparser points will be pushed apart to be at a minimum their core distance away from any other point. The algorithm then employs a Minimum Spanning Tree [42] to identify the dense regions, builds up a hierarchy of clusters and condenses those as required before extracting the final clusters. HDBSCAN works well even when clusters are arbitrarily shaped and of dissimilar density and sizes.

c: K-MEANS
One of the most used and common centroid-based clustering algorithms around that attempts to cluster similar data points together to find any underlying pattern. K-means delivers the final output through a process called iterative refinement. It tries to minimise the sum of the squared distance between VOLUME 8, 2020 the data points and the cluster's centroid. 'Centroid' is defined as the arithmetic mean of all the data points that belong to that cluster. The number of groups is denoted by K, and iteratively each data point is assigned to one of these groups of clusters based on the identified similarities among the features [43]. The Initial number of clusters 'K' has to be provided as an input. It can sometimes be a delicate issue and users often end up running the system multiple times with different values of K, and afterwards a comparison is drawn to select the best value of 'K'. However, various methods are available for getting a reasonably stable approximation of K [43]. K-means most commonly uses 'Euclidean Distance' to determine the distance between two data points (Z n and Z m ) as shown in (8) [44]. One of the key advantages of K-means is that in case number of features are really high, it can still complete the computation in a reasonable time if the value of 'K' is kept rather small.
Given a set of d-dimensional real vector observations (y 1 , y 2 , . . . , y n ), K-means clustering targets, as shown in (8), at partitioning the n observations into k (≤n) sets S = {S 1 , S 2 , . . . , S k } so as to minimise the Variance. µ i denotes the 'Mean' of S i and V is the Variance in (9).

d: SPECTRAL
Spectral clustering is gaining considerable grounds in recent years due to its straightforward implementation and encouraging performance especially in graph-based clustering, which quite often outperforms other frequently deployed algorithms such as the K-means. Spectral clustering starts off by generating a 'Similarity graph' between inputted N objects to cluster. Then it defines a feature vector for each of the N object by computing the first k eigenvectors of its Laplacian matrix before executing a K-means on these features to group objects into k clusters. Epsilon neighbourhood graphs, or εneighbourhood graphs is the most common method of building the 'Similarity graph' which is a non-negative symmetric graph. Each vertex is connected to vertices falling inside a real-valued circular radius ε (epsilon), which requires necessary tuning to capture the local structure of data.
The crux of the algorithm are the graph Laplacian matrices [45], L, as demonstrated in (10).
where, A is a 'Adjacency matrix' having A ij ≥ 0 of graph G. D is the 'Diagonal matrix' of A. A normalised form of Laplacian matrix, L pq , is often defined as in (11) Unlike K-means, K-modes applies a straightforward measure of matching dissimilarity for categorical data. Additionally, instead of using 'means' for centroid creation, K-modes relies on 'Mode' statistics. The algorithm calls upon frequencyrelated strategy on these 'modes' to limit the clustering costs as much as possible and K-modes banks on frequency-related techniques to minimise the costs. These differences from Kmeans accounts for its ability to handle pure unconverted categorical data. [46]. A dissimilarity measure for Y 1 and Y 2 , two n-dimensional vectors, can be obtained through (12). A higher number of mismatches will clearly indicate the lower degree of similarity between Y 1 and Y 2 . This dissimilarity d, can be expressed as: Another useful Density-based clustering algorithm that has a close similarity to HDBSCAN in the way it works. OPTICS also works well with varying cluster sizes. Core samples of high density are first identified and subsequently expanded into multiple clusters from those core samples. An ordering of all objects in a given dataset is initially calculated to begin with. Now for each of the objects or points in that dataset, core-distance and an appropriate 'Reachability distance' is stored. Reachability distance, r_d, of an object or point y with respect to another object or point x is the smallest distance from x if x is a core object (objects having dense neighbourhood). Basically x is a core object or point if at least min_pts points are found, including itself, within its εneighbourhood N (x). It also cannot be smaller than the core distance, c_d, of y as demonstrated in (13) [47]. Epsilon, ε, denotes the maximum distance to consider, while min_pts indicates the minimum number of points needed to form a cluster.
r_d ε,min_pts (y, x) d(x, y)) OPTICS then maintains a list known as OrderSeeds to produce the output ordering. Objects in OrderSeeds are sorted by the reachability-distance from their respective closest core objects. OrderSeeds is a linear list of all objects under analysis and represents the density-based clustering structure of the data, from which basic clustering information of that dense area, such as shape and centroid can be retrieved.

2) DETERMINING CLUSTERING TENDENCY OF THE DATASETS
Before applying any clustering algorithm to our datasets X r P1 and X r P2 , evaluation using Hopkins Statistic [48] had been carried out just to confirm the presence of non-random, relevant cluster-like structures within the data. That is, whether our datasets indeed have meaningful clusters to begin with. Table 5 summarises the findings indicating high confidence in the presence of relevant clusters as the probabilities are above 90%. Hopkins statistic is basically a spatial measure that tests the spatial randomness of a variable as distributed in a space [58].

3) CLUSTERS PRODUCED BY THE ALGORITHMS
In a perfect world, if an algorithm can deliver 100% accurate results, would produce the clusters as shown in Fig. 8.
As can be seen in Fig. 8, all the data points from X r P1 form perfect clusters where no overlapping can be detected. That is, no records (data points) of spam emails have been misclassified into ham, or in the context of the 3D perspective of the figure, we can say 'data points not going up from bottom plane to z 1 area (marked in red dotted box)', and no ham have also been misclassified as spam emails, or in the context of the 3D perspective of the figure, we can say 'data points not going down to the to the bottom plane at z 0 area region'. However, in real world it would not be as perfect as this as there are bound to be some misclassfications, and subsequent admixture of data points from both clusters. In the following section, we will visually examine how the algorithms clustered X r P1 . The algorithms that get closer to the above perfect-clustering, implies better performance. a: CLUSTERS PRODUCED BY BIRCH Fig. 9(a) and Fig. 9(b) display the two clusters generated by BIRCH from top-down and bottom-up view. The images tell us that the clustering is actually quite poor as a large number of ham have been misclassified as spam whereas some degree of spam emails have also been misclassified as ham. The overall clustering achieved by BIRCH thus remains far from perfect in this case. The high degree of ham getting misclassified is the key issue here that clearly caused the quality of overall clustering to plummet. The red dotted line along with the yellow signage in the following scatterplots approximately indicates the area of misclassification.

b: CLUSTERS PRODUCED BY HDBSCAN
Clusters generated by HDBSCAN did not show strong performance either, as can be seen from Fig. 10 (a) and 10 (b) In this case it has been the other way around. High degree of spam emails had been misclassified as ham and some degree of ham into spam emails. As HDBSCAN does not accept the exact number of clusters as a parameter, we have set the 'min_cluster_size' as the 17% of the inputted data, leading to a 2-cluster solution.

c: CLUSTERS PRODUCED BY K-MODES
K-modes performs better in the context of BIRCH and HDB-SCAN as projected in Fig. 11 (a) and 11 (b), but still considerable overlapping can be seen in both the regions of misclassifications, thus both Ham and spam emails have been wrongly identified to quite a large extent which indicates Kmodes' performance is still not up to the scratch and thus keeps the door open for other clustering algorithms.

d: CLUSTERS PRODUCED BY SPECTRAL
Spectral clustering has been visualised in Fig. 12 (a) and Fig.  12 (b). Corresponding figures show that Spectral seems to have achieved significantly better clustering in comparison to the algorithms we have investigated thus far. Regions of VOLUME 8, 2020 misclassifications are mostly empty except few strips consisting of misclassified data points. Overall it seems to have performed better than K-modes, and most certainly outperformed both BIRCH and HDBSCAN.

e: CLUSTERS PRODUCED BY K-MEANS
The performance demonstrated by K-means as can be seen in Fig. 13 (a) and 13 (b) is comparable to Spectral. The clustering structures in both the cases is almost similar. Thus when we go to validation of results, we can then quantify the performance of these two algorithms and deduce which one performed better over others.

f: CLUSTERS PRODUCED BY OPTICS
OPTICS, from Fig. 14 (a) and 14 (b), seems to have surpassed all the other algorithms analysed thus far and as it appears to be from visualisation, produced the most compact set of clusters. There are noticeably very few misclassifications and overall the cluster structures are closes to the optimum level as illustrated in Fig. 8. Again as there is no direct parameter that takes number of clusters as input, we had to set 'min_cluster_size' to 23% of the data, 'min_sample' to 50 and 'cluster_method' as 'xi' in order to produce a twocluster solution.
The visualised output of the above discussed clustering algorithms indicate indeed that not all the algorithms are suitable for clustering emails into ham and spam. K-means, OPTICS and Spectral seem to be producing better clusters than BIRCH, K-modes and HDBSCAN. However, we need to quantify the results with proper validation methods to cement a decision with high degree of confidence.
Python's scikit-learn's [49] implementations of the algorithms have been used for this research initiative.

g: CHOICES OF DISTANCE METRICS
The proper selection of Distance Metrics (signifying how similar or far apart are a pair of data points) is essential to VOLUME 8, 2020 any clustering and critically affects the performance of the algorithms. Oftentimes it depends on the type of data in the dataset and the problem domain.
Due to the numeric nature of our dataset, the Euclidean distance has been used for all the algorithms, except for Spectral clustering, where the 'rbf' (Radial Basis Function [50]) kernel has been preferred (however, 'rbf' internally uses Euclidean Distance as well) and in K-modes, where the dissimilarity is calculated differently (the technique is often known as 'Hamming Distance [51]') to that of Euclidean distance. Numeric conversion of categorical data with collision avoidance [52], [53], as in our case, need to be handled with appropriate caution. As because 'ranking' of features is not one of our purposes, and we are only looking to group data points in relevant clusters based on the calculated distance between those data points, the Euclidean distance can be applied in our case.

E. CLUSTER VALIDATION
To measure the 'Goodness' of the produced clusters and to get an objective insight into the clustering algorithms' merit, an extensive degree of validation methods has been applied as shown in Fig. 15. In those case where the optimum number of clusters is unknown, validation can also provide a credible estimate on that. Validation can be done primarily in two ways: Internally: Such processes evaluate the connectedness (how well a pair of data points within the same cluster is connected to each other than those are to with other immediate date points placed outside the cluster), and the compactness (how closed are the data points, placed inside the same cluster, to each other) [54]. Internal measures do not require any prior cluster labelling or ground-truths. Acceptable clusters have minimal 'Connectedness' and 'Compactness'.
Externally: These validation techniques gauge the degree to which cluster labels match class labels supplied externally [55]. As our datasets are custom-built, we have the luxury of using External measures; note that these class labels have not been used in any of the processes discussed in previous sections. We will also look at the 'True Rate of Detection' (the 'Recall' measure) for each of the clusters.
A number of validation methods as outlined in Table 6 have been applied.

1) INTERNAL VALIDATION
In this section, we will have a look, using various internal metrics, how the clusters have been validated.

a: DAVIES-BOULDIN INDEX
The metric works on the basis of the ratio of withincluster distances to between-cluster distances [56]. Smaller VOLUME 8, 2020 the value, better the clustering. A factor to note is that we have used the reverse of Davies-Bouldin Index, (1-Davies-Bouldin Index). This will reverse the direction but will make it consistent with other indices used in this research, without affecting the overall outcome. The Davies Bouldin Index (DBI) can be calculated for any value of n_cluster (n c ) using (14) [63], where d is the Euclidian Distance between the points, c j is the cluster j having x j as the centroid, From Fig. 16 we can observe that the metric considers Spectral, K-means and OPTICS to be of 'almost' equal performers while K-modes is considered the poorest, followed by HDBSCAN and BIRCH. The conclusion slightly defers from the knowledge that we have gained through Scatterplot visualisation of clusters in previous sections, where K-modes had been thought to have edged out both BIRCH and HDB-SCAN, Also, as indicated by the scatterplots, OPTICS, on a visual scale at least, had achieved the optimum clustering, which we can see here as well. However, we will investigate further, through other metrics to determine the degree of correlation to our visual clues.

b: CALINSKI-HARABASZ INDEX
A ratio-type index that evaluates the cluster validity by comparing the average between-and within cluster sum of squares [57]. A higher value indicates better proposition. The index, CH k , is defined as in (15) [64], where V b is the overall between-cluster variance, V w is the overall within-cluster variance, N is the number of observations and k denotes the total number of clusters.
The Calinski-Harabasz Index does not have any maximum range thus the results returned can be quite long, for instance the measure for Spectral's performance has been quantified as 111546.382. Thus we have confined the results within the range of 0 and 1, which may diminish the differences among algorithms slightly, but still provided the general trend. In addition, we were able to gather comparative insights in relation to other indices.
In this research, a slight variation of Sigmoid function, as shown in (16) has been applied to the outputs of Calinski Harabasz Index to 'squash' it within 0 to 1 so that for each output x, we get a corresponding value v after squashing such that {v: v > 0 and v < 1}. n is the length of the integer part of the highest value returned by the Calinski-Harabasz Index. Fig. 17 projects the evaluated results which somewhat matches to what we have assumed from the scatterplot cluster visualisation.
Though OPTICS seem to have performed lesser than Kmeans and Spectral, while all other algorithms showed unsatisfactory clustering, with BIRCH being the poorest. Additionally, in reality the difference between the algorithms that have shown promising results and the rest, is actually quite substantial and sharper than what appears in the figure, but the general trend remains the same.

c: SILHOUETTE COEFFICIENT SCORE
One of the most widely used internal cluster validation techniques. The Silhouette Coefficient score, c, is derived for each of the samples using the mean within-cluster (intra-cluster) distance p and the mean nearest-cluster distance q, generally using (17) [58].
where, q is the distance between a sample and the nearest cluster that the sample is not a part of. The metric is primarily an intuitive graphical tool that aids the user in visually assessing cluster quality. The Silhouette Coefficient scores for each of the algorithms have been charted in Fig. 18. The above chart shows maximum resemblance to our assumptions internalised from previously discussed cluster scattepplots. OPTICS clearly performed best with K-means and Spectral are not so far behind, whereas the remaining three projected disproportionately unsatisfactory results.
In Fig. 17 we can see Silhouette Plots for each of the algorithms, derived from a subsample of X r P1 (the ratio of spam to ham kept the same in the subsample). Both the clusters are shown in each of the plots. Though there are some outliers, generally Spectral and OPTICS fared better (approaching towards +1), with K-means close behind. The plots give us a quick visual perspective of the clustering quality based on internally calculated Silhouette scores, even though the scores will not exactly match the scores of Fig. 19, as the observations used is a set of limited subsample, but still we can get a general idea and gauge how closely it corresponds to our findings till now. The red segment of the plot represents the cluster of spam emails, while the teal one is for cluster of ham.

d: SUMMARISING THE INTERNAL VALIDATION OUTCOMES
As we have gone through a number of internal validation techniques, Table 7 presents a summarised view of the outcomes, summing up the positions for each of the algorithms across the validation charts; from which it is quite clear that OPTICS and Spectral showed commendable performance while K-means also not so far behind, however, other three algorithms did not have much of a positive clustering outcome. The scatterplots of section V.D.3 also picturised a similar pattern. However, to get a complete and comprehensive picture, we will now carry out a number of External validations.

2) EXTERNAL VALIDATION
This section will provide a detail inspection, using various external metrics, on the quality of the clustering.

a: ADJUSTED RAND INDEX (ARI)
The Rand Index (RI) works out a similarity measure between two sets of clusterings by taking into account all pairs of provided samples and totaling pairs that are assigned in the same or different clusters in the predicted as well as in the true clusterings [59]. The raw RI score is then 'adjusted for chance' into the ARI using (18).
Scores closer to 1 signify better clustering. Fig. 20 graphically relays the results after validation through ARI.

b: ADJUSTED MUTUAL INFORMATION (AMI)
The Mutual Information (MI) quantifies the degree of information the two clusters in question have in common and often in information theory referred to as 'Correlation Measure'. The MI score is then 'adjusted for chance' to get the AMI [60]. AMI of two clusters S and H , is determined using (19), where T is the Entropy.

AMI(S, H)
Scores closer to 1 signify better clustering. Fig. 21 graphically relays the results after validation through AMI.

c: V-MEASURE
V-measure or Validity measure of a cluster is basically a metric developed using conditional entropy analysis. Entropy measures the degree of disorder within a cluster. V-measure takes the Harmonic mean of two important characteristics of a cluster, Homogeneity -measure of a cluster holding only members of a single specific cluster, and Completenesswhether all members of a given class are allocated to the same cluster [61]. V-measure, v is given in (20). The default value of β is 1, signifying equal weightage of homogeneity and completeness.
Scores closer to 1 indicate better clustering. Fig. 22 charts the validation results obtained through V-measure.

d: PURITY
Purity is a simple and transparent external validation measure that is often regarded as the 'Cluster Accuracy'. Purity is the ratio of the total number of data points belonging to the  dominant class in a cluster to that of its size. Scores closer to 1 suggest better clustering [62]. Fig. 23 shows the measures of purity for each of the algorithms, closely resembling the other measures.

e: SUMMARISING THE EXTERNAL VALIDATION OUTCOMES
As we have gone through a number of external validation techniques, Table 8 presents a summarised view of the outcomes, summing up the positions for each of the algorithms across the validation charts. The results obtained are seemingly consistent with results from internal validations, OPTICS still emerges as the best performer, with K-means and Spectral are perilously close at second spot. The rest of the three algorithms were largely far behind. BIRCH and HDBSCAN were amongst the least performers, whereas Kmodes was slightly better, although not satisfactory.
In the next section, cumulative positions for each of the algorithms across all the seven validation tests (Internal and External) will be graphically portrayed.

VI. DETERMINING TOP PERFORMING ALGORITHMS
The figure below (Fig. 24) charts the cumulative positions of each of the clustering algorithms, starting from Davies-Bouldin Index to the Purity measures at 7 th spot.    24 clearly shows that OPTICS had a clear and uncontested clustering 'goodness' throughout; while K-means and Spectral were relatively close for the most part, though Spectral seem to edge ahead with slightly better performance. The rest of the three, as confirmed now, clearly failed to display any commendable rendition and basically way apart from the top three, though K-mode occasionally was moderate in terms of clustering quality.
Thus, in light of the all the validations and clustering scatterplots, we can conveniently say Spectral and K-means have been quite good at producing high quality clusters of Ham and Spam emails, while OPTICS delivered clusters that are closest to the optimum quality, with, on an average, 3.5% better performance, than that of Spectral and K-means (x), calculated using (21).
OPTICS has demonstrated a Purity of 93.1%, while K-means and Spectral scored 92.9% and 92.8% respectively. Such high Purity along with other validation measures, clearly indicate that the quality of ham and spam email clusters produced by these three algorithms based on the header and domain features of emails are quite high.

TRUE DETECTION RATE (THE RECALL MEASURE) OF SPAM AND HAM IN EACH CLUSTER
Well the above external measures indicate the performance of the algorithm as a whole. However, we will now shed lights on another form of external validation-narrowing down to a more granular scale and scrutinise how well each of the algorithms can truly differentiate ham from spam emails by finding out the 'True' detection rate (TDR) of ham and spam emails as indicated in V(b) of Fig. 15 (the 'Recall' measure [65] -calculated separately for each clusters). It is quite clear from Table 9 that the optimum algorithms (K-Means, Spectral and OPTICS) have high and balanced true detection rates for both ham and spam emails, while the low performing algorithms may show strength in one specific types of clustering, but not in both; the results for these algorithms may be heavily skewed towards either of the clusters. The distribution of the detection rates can be observed from the Box Plot of Fig. 25. BIRCH and HDBSCAN have not been considered in this Box Plot due to their heavy skewness of detection rates towards certain directions.
Considerable gap between the two median points. This gap should be as minimum as possible for an overall balanced outcome. Additionally, the height of the boxes should be somewhat similar and shorter around the high nineties to account for the balance between ham and spam emails as well as for the high degree of True Detection Rates. The plot analyzes the performance of the algorithms as a group.

VII. EVALUATION OF OUTCOME OF A NEW SET OF DATA (X r P2 )
It is important to gauge the performance of these six algorithms on another new set of data as graphically portrayed in Fig. 26, so that the findings above can be confirmed with a considerably high degree of confidence. The entire process of clustering the raw dataset X r P2 into spam ham and spam emails and validating the results have yielded the Heatmap shown in Fig. 28. The Heatmap itself clearly visualises the clustering quality for Spectral, OPTICS and K-means has been above average under majority of the validation schemes, even in this new dataset, while the other three algorithms have not managed to do well enough, as evidence by the reddish blocks. The finding positively corre- lates with the outcome of our earlier investigation. K-modes performance, just as before, is not as severe as BIRCH and HDBSCAN, but certainly leaves a lot to desire.
The positions that the algorithms have obtained in each of the validation tests on the clusters for the new dataset, are also quite close to what we had earlier. Fig. 27 presents the cumulative position of each of these algorithms, showing almost a similar trend to that of Fig. 24. It is evident that multiple validation measures need to be executed on the obtained clusters to get a complete picture.
As stated before, Fig. 27 shows the cumulative trends for each of the six algorithms. The figure clearly highlights.
The plot (Fig. 28) clearly projects that in terms of percentages, the concentration of TDR of spam is in high nineties while it is around mid-eighties for TDR of ham (the 'Median' component of the boxes). Thus there is a OPTICS, Spectral and K-means indeed consistently demonstrated strong performances. Spectral seems to have performed slightly better than K-means just as with the other dataset. More importantly, the general trend here projects sharp resemblance to our findings in the previous instance. Table 10 shows the True Detection Rates for the dataset X r P2 . We can clearly observe the similar patterns as well, but the optimum three algorithms this time have demonstrated even better balance between the detection of the class of emails.
Upon deeper scrutiny of distribution patterns of detection rates through Fig. 29, we can discern that the relative difference in detection rates for ham and spam emails is much narrower in this instance than what we had previously, signifying a reasonable balance.   Moreover, the plot suggests overall performance have in fact been better than what we have observed for the dataset X r P1 in Fig. 25. BIRCH and HDBSCAN have not been considered in this case due to the considerably poor and skewed outcomes. Besides, if we take a look at Fig. 30, it can clearly be observed that the Balanced Accuracy ([TDR of ham + TDR of spam emails] / 2) obtained by the algorithms on X r P2 dataset follows the same general trend obtained from X r P1 . The average balanced accuracy for BIRCH, HDSCAN and K-modes across both the datasets found to be around 64.5%, while in case of OPTICS, Spectral and K-means, it is around 94.91%.
Therefore, in light of this detailed investigation on the performance of six key unsupervised algorithms on clustering ham and spam emails into respective categories, it is fully obvious that OPTICS, Spectral and K-means are able to demonstrate better outcomes than some of the other algorithms, with the margin of difference being quite substantial. VOLUME 8, 2020  Appendix B has the detailed and complete follow diagram of the whole process.
We have compared our work, as shown in Table 11, to a number of the reasonably recent and related initiatives (2014 onward), that use some form of unsupervised learning in the differentiation of ham and spam emails. However, these studies have limitations that can be taken advantage of by the scammers and are not always suitable for implementation in actual business settings. Most of these studies are focused on part of the email framework and do not evaluate a number of important features of both header and content of the emails used for analysis; some of the dataset used for these studies are also quite limited.

VIII. CONCLUSION AND FUTURE WORKS
The paper detailed out the first part of a comprehensive framework based completely on unsupervised methodology to unearth the behaviourial pattern in differentiating ham from spam emails through clustering.
Our attempt started off with the creation of the raw database of nearly half a million records of ham and spam emails from multiple email collections; this was a pivotal work that is the basis for not only this research, but also for the forthcoming propositions. From this database we developed a pre-processed dataset of 100,000 records comprising both ham and spam emails, containing critical header and domain information (except for 'subject' field). This dataset was then sliced up into two parts (X r P1 and X r P2 ), containing 60% and 40% of the data respectively, maintaining the same ratio of spam emails to ham. A novel feature reduction method had been applied on the complete dataset before partitioning it to keep the most impactful of header and domain features for clustering purposes. This feature reduction algorithm is an ensemble of three distinct unsupervised feature selection algorithms, namely, PCA, MCFS and Laplacian Feature Selection. The method achieved a 20% reduction of the preprocessed dataset with significantly high confidence.
Afterwards a set of algorithms-OPTICS, Spectral, Kmeans, HDBSCAN, BIRCH and K-modes were used to cluster the two datasets on two separate runs, and in both cases, of this after thorough validation process, it was found that OPTICS, Spectral, K-means shown commendable performance, while the other three were not optimum. Such a study on identifying ham and spam emails using 'only' unsupervised methods, acting upon solely on email header and domain features (except for 'subject' field) has been a completely novel undertaking. In our future endeavor, we intent to propose the second segment of the framework where algorithms used in this study will be implemented on email body and subject field for clustering purposes. The Purity of the clusters produced by the three best performing algorithms in this study, though is extremely good, but some degree of misclassifications are still there. In future propositions of the framework, we aim to reduce this misclassification rate further, and also implement a third category aside from 'spam' and 'ham', which will cluster 'weakly defined' 'spam' and 'ham' and form a cluster of unspecific data points or emails. In this way the emails identified as spam and ham will have more confidence in its classification, while the users will have the chance to act on this third cluster independently of the system, resulting in a more consistent outcome tailored to users' need, thereby improving user satisfaction and satisfactory balance between all the clusters (minimum skewness). The end results of future research attempts in combination with the knowledge gained from this study, will be the key to develop the intended hybridised unsupervised anti-spam framework.
Additionally, we plan to make our entire database of half a million records publicly available for further research by others when we complete the entire research project through subsequent analysis, experimentation and publication after this article. We have already made the preprocessed dataset of 100,000 records, used for Header analysis in this research, publicly downloadable from github.com/asif5566/dataextract for inspection purposes. The future dataset developed for content and subject analysis will also be made available in due course. We believe such a large, dense and ready-made database of ham and spam emails, containing almost all the relevant fields and content of an entire email, will be a significantly useful contribution for future research undertakings.

APPENDIX A A. MORE ON DATASET CONSTRUCTION AND PREPROCESSING
We have used Python 3. 6 and several related libraries for gleaning out all possible header fields from the text corpuses VOLUME 8, 2020 as well as required preprocessing afterwards. WHOIS records and domain record warehouses have also been consulted for general domain information. A number of mainstream IP Blacklisting databases have been looked into for the status of the source IP. The seminal database of over half a million records has 14 preprocessed header and domain featuresfrom address, date of mail, source IP, whether source IP is blacklisted (Boolean in nature), originating domain, internet domain of the originating country, registrar, age of domain, reply address, return path, message ID, type of the email content, arrival time and total hop count. 'Total hop count' is the total accumulation of the 'Received' field. This database was then used to produce the dataset used in this study. For our subsequent research initiatives, we will be adding the email subject and content to this database. Table 12 lists all the 14 features and the corresponding datatypes and lengths. The datasets used in this research have been derived out of this database as mentioned earlier. Not all the 14 features have been used due to reasons stated earlier as well.

B. COMPARISON TO SUPERVISED MODELS
Support Vector Machines (SVM) and Naive Bayes (NB) are two most common supervised algorithms using which a number of antispam models have been developed over time, using many common publicly available datasets. To provide a reasonable comparison against such common supervised methods, we trained two supervised models developed using SVM and NB, that showed a 'Test Accuracy' of 97. 44% and 94. 57% with a 60-40 split respectively. The model using SVM achieved somewhat better accuracy than our unsupervised counterpart, while NB performing reasonably at similar scale. With further research down the line, there is a significant possibility that unsupervised model will become considerably more efficient.
ASIF KARIM lives in the Port City of Darwin. He is currently a Ph.D. Researcher with Charles Darwin University, Australia. He is also working towards the development of a robust and advanced e-mail filtering system primarily using machine learning algorithms. He has considerable industry experience in IT, primarily in the field of software engineering. His research interests include machine intelligence and cryptographic communication.

SAMI AZAM (Member, IEEE) is a Leading
Researcher and a Lecturer with the College of Engineering, IT and Environment, Charles Darwin University, Australia. He has a number of publications in peer-reviewed journals and international conference proceedings. His research interests include computer vision, signal processing, artificial intelligence, and biomedical engineering.
BHARANIDHARAN SHANMUGAM (Member, IEEE) is currently a Research Intensive Lecturer with the College of Engineering, IT and Environment, Charles Darwin University, Australia. He has a large number of publications in several different journals and conference proceedings. His research interest includes the field of cybersecurity.
KRISHNAN KANNOORPATTI is currently a Research Active Associate Professor with the College of Engineering, IT and Environment, Charles Darwin University, Australia. In addition of being a Stellar Academic and Innovative Researcher, he also has an extensive experience of working with the government bodies in setting up data privacy policies at national and state level.