Non-Parametric Clustering Using Deep Neural Networks

In this paper, a novel algorithm for non-parametric image clustering, is proposed. Non-parametric clustering methods operate by considering the number of clusters unknown as opposed to parametric clustering, where the number of clusters is known a priori. In the present work, a deep neural network is trained, in order to decide whether an arbitrary sized group of elements can be considered as a unique cluster or it consists of more than one clusters. Using this trained neural network as clustering criterion, an iterative algorithm is built, able to cluster any given dataset. Evaluation of the proposed method on several public datasets shows that the proposed method is either on par or outperforms state-of-the-art methods even when compared to parametric image clustering methods. The proposed method is additionally able to correctly cluster input samples from a completely different dataset than the one it has been trained on, as well as data coming from different modalities. Results on cross-dataset clustering show evidence of the generalization potential of the proposed method.


I. INTRODUCTION
One of the fundamental challenges in computer science is the task of grouping data into categories in an unsupervised manner. An abundance of methods and algorithms on clustering has been proposed thus far in many scientific fields in different areas, such as mathematics, statistics and computer science exploiting traditional analytical methodologies, machine learning and various techniques based on neural networks. Data clustering is used in an excessive number of applications ranging from text mining, video analysis and medical imaging to social science and humanities. The ability to group similar data and distinguish dissimilar ones is essential in broadening and expanding the clustering research field and associated applications.
In particular, extracting underlying connections between high-dimensional data is tackled during the last years with a plethora of different approaches with methods that can be highly distinctive. Subspace clustering algorithms like in [1], [2] and [3] try to extract clusters from multiple and possibly overlapping high-dimensional subspaces. In [4], an algorithm based on sparse representations of the data is The associate editor coordinating the review of this manuscript and approving it for publication was Hong-Mei Zhang .
presented. This work states that a sparse representation of a data point is essentially a linear or affine combination of data points that belong to the same subspace, a property defined as self-expressiveness that is therefore used to group data together. In Deep Subspace Clustering networks (DSC) [5], the authors presented the idea of blending traditional clustering techniques with modern machine learning ones. They introduced a differentiable, non-linear layer that tries to mimic the self-expressiveness property by learning pairwise affinities between data through standard back-propagation. The idea of employing machine learning mechanisms in support of known clustering algorithms though is not new; Song et al. [6] confronted the high dimensionality challenge by using autoencoders rather than typical dimensionality reduction methods such as PCA [7]. Moreover, instead of applying a standard clustering algorithm such as k-means on the encoded data, they proceeded with creating a new objective function, able to fuse loss properties of both the auto encoder and the k-means input and output. In [8] and [9], k-means is combined with neural networks in order to segment medical images and detect brain and kidney abnormalities, respectively. Although methods based on established clustering algorithms excel in ease of use, data adaptation and scalability, the requirement of the number of clusters parameterization is a major restriction on using them in real life applications. Tian et al. [10] proposed a method based on [4], where, after acquiring a weight matrix through a kernel method, they used a modified spectral clustering algorithm based on autoencoders. Other approaches have attempted to cluster multimodal data. Abavisani and Patel [11] examined different multimodal fusing techniques and proposed a method for fusing affinities across data modalities based on DSC. An additional clustering approach based on popular clustering algorithms is called Robust Continuous Clustering (RCC) [12], which achieves clustering and dimensionality reduction jointly by optimizing a continuous objective function that uses mutual k-nearest neighbors (m-kNN) information [13].
Density-based clustering algorithms such as DBSCAN [14] have also been proposed in the literature. The main advantage of DBSCAN is the fact that it does not require any parameterization regarding the expected cluster number. DBSCAN is robust to outliers, however it is rather ineffective when applied on high-dimensionality data, especially when the density of each data group is unknown. Other density-based clustering works have been presented, such as [15], which is, however, parametric with respect to the neighborhood size and specialized in handling datasets with various data distribution patterns. Spectral Clustering methods have also been proposed, as in [16], where the pairwise constraints of data points are responsible for generating dynamically adaptive neighborhoods of data points, while preserving a low algorithm complexity. Furthermore, the affinity propagation algorithm [17] and its extensions [18]- [21] are methods that do not make use of the number of clusters as input (nonparametric). Affinity propagation methods initially consider all items as potential centers and then proceed by letting the initial centers exchange messages carrying information on how they should merge. These methods are robust to outliers, however, their greedy strategy results in a high computational complexity of O(F 2 G), F being the total number of items and G the number of algorithm iterations [22]. Finally, an additional non-parametric method is proposed in [23], in which, a general fuzzy min-max (GFMM) neural network is employed in order to fuse classification and clustering, in a simple yet powerful learning process.
The current work proposes a novel clustering method able to cluster data coming from various classification databases without prior knowledge of the exact number of clusters (non-parametric). To achieve this, a deep neural network is trained in order to identify if the contents of an arbitrary sized group of data belong to the same cluster or not. Therefore, the developed neural network is employed as a clustering criterion by an iterative algorithm in order to cluster any given dataset. Furthermore, the present work is also tested for applications that the data, which are to be clustered, have never been encountered before (cross-dataset clustering). The proposed method is able to correctly cluster input samples from a completely different dataset than the one it has been trained on, as well as data coming from different modalities.
The remainder of this paper is organized as follows: Section II is a brief presentation of relative works on clustering. Section III details the proposed non-parametric clustering method, whereas Section IV describes how data are processed and organized to be ingested to the deep neural network. Section V analyzes the results of the performed experiments, and Section V-F concludes the paper.

II. RELATED WORK
More recent approaches than the ones already described in Section I, have integrated Convolutional Neural Networks (CNN) into clustering. For example, the Joint Unsupervised Learning of Deep Representations and Image Clusters (JULE) [24] method applies agglomerative image clustering, while learning image representations at the same time and yields excellent results on most datasets. JULE also achieves good results in cross-dataset clustering. Additionally, in [25], a k-means clustering in conjunction with classification in an alternating approach that produces soft labels is proposed. This approach highlights the strong relationship between data clustering and classification and how convolutional networks have been able to play a leading role in both fields. Dizaji et al proposed in [26] a lightweight clustering algorithm that projects data into a subspace and then utilizes a stacked multi-layered convolutional autoencoder with a softmax on top to predict clusters. All the aforementioned methods require a known number of clusters to operate. For instance, in [26] the number of clusters is given a priori, whereas in [24] and [25] the number of clusters is estimated by applying a clustering algorithm such as DBSCAN [14] or t-SNE [27] before proceeding to their actual method. The proposed method demonstrates the ability to achieve similar or superior results without knowing the number of desired clusters in advance.
A different approach is introduced in Deep Embedding Clustering (DEC) [28], where the authors firstly project data into a space of smaller dimension by employing a non-linear mapping function. DEC is learning (in a simultaneous manner) the cluster centers by minimizing the Kullback-Leibler (KL) divergence between the distribution of the items and an auxiliary target distribution. In [29], the authors propose a feature extraction method using deep convolutional neural networks trained in distinct faces from other identities as well as a new cluster-merging algorithm that measures similarity, based on local density. Another approach based on convolutional networks is the one in [30] where the clustering problem is approached by training a deep autoencoder to initially extract features. Then, a density-based algorithm is applied in order to calculate the total number of clusters. Both methods achieve good clustering results without prior knowledge of the number of clusters, though, on a restricted number of datasets. The proposed method is extensively tested against multiple and diverse datasets, yielding comparable results with parametric methods, or even better results when compared with non-parametric ones. (red). Each subset is accompanied with a binary label y 1 , y 2 , . . . , y N s (green), indicating whether the subset can form a single cluster or not. The subsets are first forwarded through a deep neural network, the Evaluation Network (EN), training a clustering criterion. This criterion is employed by the Clustering Process (CP) to create clusters C 1 , C 2 , . . . , C N C and a list of unclustered items B, which will be fed to the network again.

III. PROPOSED METHOD
In this Section the proposed clustering method and the overall framework are described in detail.

A. OVERVIEW
Let D be a set of items d i , i = 1, 2, . . . , N D , divided in N K mutually exclusive partitions K l , l = 1, 2, . . . N K ,namely N K l=1 K l = D and K l ∩ K m = ∅, ∀ 1 ≤ l, m ≤ N K with l = m. Let also S = {s 1 , s 2 , . . . , s N s } be a random subset of D (S ⊂ D). The elements of S can either belong to the same cluster or not.
The proposed method consists of two components, the Evaluation Network (EN) and the Clustering Process (CP). EN is a binary classification network, trained to distinguish if an input sample-set contains items coming from the same ground truth class and therefore can be recognized as a cluster or not. CP is an iterative procedure that uses EN as a clustering criterion in order to decide whether the random subset S of D forms a single cluster or not. In the latter case, CP proceeds by grouping all similar items of S in a single cluster C 1 . This procedure is repeated on the remaining items of S in order to form the next cluster C 2 and so forth. The proposed pipeline is illustrated in Figure 1 and will be further analyzed in the subsequent subsections.

B. EVALUATION NETWORK
To train EN, the data need to be arranged through the following procedure; An input sample-set S y is essentially a set that is assembled by the union of two distinct subsets. These subsets can contain items from one or multiple clusters, and thus labeled as y = 1 (pure) or y = 0 (impure), respectively. Either case can be expressed as: where T K l ,N 1 is a set of N 1 elements from cluster K l , T K l ,N 2 is a set of N 2 elements from cluster K l and T D−K l ,N 2 stands for a set featuring N 2 items from classes different than K l . In the creation of the training, validation and test datasets, the number of S 1 and S 0 samples are equalized to avoid bias towards one or the other case. Three ways of combining data were examined during experimentation: a) set-plus-set, b) one-plus-one and c) set-plus-one. The different cases are obviously defined by the values N 1 and N 2 get.
In the set-plus-set case each subset contains N 1 , N 2 > 1 items, forming the input sample S y with cardinality |S y | > 2. The ratio of main (T K l ,N 1 ) and foreign (T D−K l ,N 2 ) class items is adjusted by modifying a weight w, w ∈ (0, 1). w is related to the subset length as N 1 = wN s and N 2 = N s (1 − w). In the one-plus-one mode, each subset contains N 1 , N 2 = 1 item, subsequently forming an input sample S y with cardinality |S y | = 2. This mode is a simplified variation of the set-plusset mode where a sample-set S y is a tuple. Finally, the setplus-one mode is a hybrid extension of the modes already described. Each sample S y is produced by the union of a subset T K l ,N 1 of size N 1 > 1 and a subset T D−K l ,N 2 of size N 2 = 1. The intuition of this mode is that input samples S 0 with only one item from main class K l are harder to be distinguished by EN, therefore allowing it to be trained more effectively. Figure 2 illustrates all the above different cases.
In the cases of set-plus-set and set-plus-one, the training, validation and test datasets contain samples S y of variable length since the system should be able to perform clustering on sets of arbitrary size.

C. CLUSTERING PROCESS
The essence of the proposed clustering method is that, given a sample-set S y of N s items, each item s i , 1 ≤ i ≤ N s is sequentially examined against the members of already formed clusters, C k , 1 ≤ k ≤ N c where N c denotes the total number of produced clusters at the current clustering stage.
In Algorithm 1 the Clustering Process is described. More specifically, the first item of the set forms the initial cluster C 1 . The next item s 2 is selected and appended to cluster C 1 , forming the temporary subset U 1 , which is examined by EN as described in III-B. If the output of EN assesses that the members of the temporary subset U 1 belong to the same cluster, CP can move on by confirming that s 2 belongs to for k < N c do for all clusters 5: for j<N C k do for all items in cluster 6: if EN (U j ) = 1 then 8: if counter > v c · N C k then 10: C k .append(s i ) 11: else 12: B.append(s i ) 13: Subsequently, the next item s 3 is forming two temporary clusters U 1 = {s 1 , s 3 } and U 2 = {s 2 , s 3 } with every item of C 1 . CP decides if s 3 belongs to C 1 by comparing the ratio of the number of positive EN decisions to |C 1 | with a voting threshold v c as follows: 1 The described procedure is repeated for every combination between each s i and each item already appended to C k , resulting in an EN decision for each U j = {s j , c k,j }, 1 ≤ j ≤ N C k , where c k,j denotes the j-th item in cluster C k , and N C k denotes the number of items already added. After the first run all items of S y have either been assigned to C 1 or left to a remainder list B. When all items of S y have been visited, Algorithm 1 starts over by selecting the first item from the remainder list B thus initiating C 2 . The total number of Algorithm 1 executions equals the total number of clusters produced. The flowchart depicted in Figure 3 complements Algorithm 1 and schematically summarizes CP. Figure 4 illustrates, step-bystep, a clustering example.

D. MERGING
It has been noted that CP tends to produce an excessive number of clusters compared to the ground truth clusters' number (overclustering). To mitigate this behavior, a merging mechanism is introduced in order to combine smaller clusters into larger ones.
The number of clusters at the end of Algorithm 1 is closely linked to the selection of the cluster voting threshold v c . Since the presented clustering method is non-parametric with respect to the number of clusters, a strict voting threshold v c usually leads to more precise clusters, however casts the method prone to overclustering. On the other hand, a loose voting threshold v c provides fewer and larger clusters but bears the risk of falsely accepting irrelevant samples into clusters. To remedy the situation, the proposed method initially adopts a strict voting threshold v c in order to get a  Example of the Clustering Process. The first item of the set, S 1 , is picked, and all following items are sequentially examined by CP. If an item is recognized as a member of the same cluster (step 3), it is appended to the cluster, and the next CP decision (step 4) will be made after taking into account each EN(U j ) result for each formed subset U j . The CP decision is shown on the right of each comparison. The green arrows represent the formation of each U j subset. Different colors represent different K l . more precise initial clustering, and then utilizes the merging mechanism presented in Algorithm 2.
At the beginning, for each one of the initial clusters, the algorithm calculates a mean vector µ k from all associated feature vectors c k,j of the items contained in cluster C k . Therefore, for each cluster, the euclidean distance d(c k,j , µ k ) between each feature vector c k,j and the mean µ k is computed, as shown in (3). Note that µ k is a calculated feature vector not necessarily associated to any item of C k .
After comparing all d(c k,j , µ k ), the smallest distance between a vector c k,j and µ k maps to c k,j element which is therefore considered as the representative item of each cluster k. Subsequently, the inter-cluster distances between the representative items of all produced clusters are calculated and a N c × N c distance matrix Z is generated. Based on distance matrix Z , the method creates tuples between similar clusters, with the smallest cluster considered the candidate and the larger one VOLUME 8, 2020 for j < N C k do compute intra cluster distances 3: dist ← d(c k,j , µ k ) 4: r.append(argmin(dist)) representatives 5: for i, j < size(r) do get all representative distances 6: Z .append(d(r i , r j ) 7: for k < N c do compute inter cluster distances 8: candidate ← argmin(Z ) 9: counter ← 0 10: for j < N C a do for all items in anchor cluster 11: if EN (U j ) = 1 then 13: counter+ = 1 14: if counter > v m then 15: the anchor. This convention is assumed so that the smaller cluster is always appended to the larger one and not viceversa. It has to be noted that due to the nature of EN, a large and precise (having items only from one ground truth cluster) cluster is expected to be more robust than a smaller equivalently precise cluster. Clusters are then merged by comparing the representative of the candidate cluster with all items of the anchor cluster through a voting mechanism similar to the one described in Algorithm 1. More precisely, if the total number of votes surpasses the ratio of the cluster cardinality |C k | to a merging threshold v m , then the candidate cluster is appended to the anchor. During experimentation, various v m values were tested and it was observed that a strict initial threshold applied on clustering yields the best results when relaxed during merging. Figure 5 illustrates how each candidate cluster is matched with an anchor cluster, based on the features' distance of their representative items.
Nonetheless, the merging procedure is exhaustive in nature and can render the whole procedure cumbersome when dealing with large datasets. To avoid unnecessary calculations, the decision on whether a new item should be accepted in an existing cluster or not is taken when the total number of votes for either case surpasses the required number of votes, defined by N C k ·v m . Algorithm 2 describes the merging FIGURE 6. Flowchart of the merging mechanism as described in Algorithm 2. An iterative process calculates the matrix of representative distances Z, which is then used to decide which clusters may merge with larger ones. The EN is employed again, in order to double check if the candidate cluster should be merged with the respective anchor cluster. procedure in a more systematic manner, accompanied with the flowchart in Figure 6.

IV. DATA PROCESSING
In this Section the input data organization in order to be processed by both EN and CP is described. Moreover, an extended list of the used datasets is provided, accompanied with the individual data processing conducted for each one.

A. FEATURE EXTRACTION AND INPUT TRANSFORMATION
Instead of using the original data as input for the network, the proposed method utilizes the feature map embeddings of a pre-trained ResNet-50 classification network and more precisely the implementation of torchvision [31], by excluding the fully connected and softmax layers in order to only extract 2048 × 1 feature maps.
The system's ability to process samples of variable length is desirable but beckons the adversity of inconsistent input size. To alleviate this, instead of the original input data, the network is fed with the data outer product matrix P. Let s i be an item that belongs to an input sample-set S y and S be a matrix that includes the feature vector representation of each s i , namely S = [s 1 , s 2 , . . . , s N s ]. Given that the total number of items is N s and the dimension of each feature vector is t, then the dimension of S is N s ×t. Consequently, the dimension of S is t ×N s . The t ×t outer product of S is therefore acquired by multiplying S by S. Finally, the resulting outer product is: To normalize the resulting values, the mean value of elements of P, denoted as p µ is firstly subtracted and then divided by N S . (5).
where J t denotes a t × t all-ones matrix. The described data transformation shifts input samples from a length variant input to an input of fixed dimensions, essential for feeding each sample-set S y to the EN. This shape transformation is crucial when input samples are created by the set-plus-set or the set-plus-one modes, i.e when N s > 2. Alternatively, if a sample is a tuple constructed by the one-plus-one mode, the combination of the two items could either be a special case of (4), where N s = 2, or a simple tensor concatenation.

B. LIST OF DATABASES
• Columbia Object Image Library (COIL) [32]: There are two versions of the COIL database, COIL20 and COIL100. COIL20 is a collection of 1440, 128 x 128 grayscale images acquired from 20 objects, whereas COIL100 is a set of objects with a wider variety of complex geometric and reflectance characteristics compared to COIL20. The database consists of 7200 color (RGB) images of 100 objects captured in the same manner as COIL20.
• MNIST [33]: The MNIST database is one of the most known and easily recognizable image databases. MNIST features 70000 hand written numbers of 10 classes split in 60000 train and 10000 test samples. • YouTube Faces Database (YTF) [36]: A database of frames captured from face videos designed for unconstrained face recognition. The YTF database contains 3425 videos of 1,595 different people. The proposed approach is evaluated on the first 40 subjects of the dataset as in [12], [24], [29], [37], which roughly contain 10000 images. Regarding the way data are fed to EN, the one-plus-one mode was finally selected after experimenting with the different variants of data processing and different mixture weights w as defined in Section III-B. The set-plus-set mode proved to be a high resource consuming option with questionable results due to the vast amount of different data combinations. Furthermore, a dataset featuring so diverse samples can potentially catapult the dimensionality of the problem and subsequently the network fails to generalize and produce certain task specific rules. Finally, the set-plus-one mode would need a far deeper network to understand the narrow differences between cluster and non-cluster samples.

C. DATA SPLITTING AND AUGMENTATION
The first step of data handling is to obtain the transformed samples that will augment the existing dataset in case overfitting is observed while training. The original data are transformed by applying 32 × 32 pixel random crops with a padding of 4 pixels and random horizontal and vertical flips. After obtaining the feature embeddings of both original and augmented data as described in IV-A, both are managed identically but independently to avoid any conflict. For COIL20, COIL100 and YouTube Faces datasets that do not provide train and test sets, the original and augmented features are split by keeping 70% of data for training and 30% for testing purposes. The same ratio is applied again on the training data in order to check the training progress on a constant validation set. In case a dataset training is evaluated by using cross validation, this is achieved in five folds; one fold is validated while training on the other four.

V. EXPERIMENTAL RESULTS
The proposed method is established on a binary classifier that needs to be able to process high dimensional data, coming from multiple, diverse sources on a minimum resource overhead. Three ResNet [38] architectures were tested, featuring 18, 34 or 50 network layers, respectively. Each residual block in the 50 layer version comprises three layers, whereas residual blocks of the smaller ResNet18 and ResNet34 networks are two-layer deep.

A. PERFORMANCE AND SPEED TRADE-OFF
In order to inspect all three architectures in terms of performance and execution time, the proposed method is applied on the COIL20 database, which is the smallest and simplest one in terms of number of items, image size and color information. All three experiments have been performed on a machine equipped with an nVidia GeForce GTX 1080 GPU, 128GB of RAM and an Intel Xeon E5-2620 v4 @ 2.10GHz CPU. In Table 1, the performance and clustering duration of each architecture are shown. Performance is measured in F-score and clustering duration in minutes. The fastest implementation is achieved when EN utilizes a ResNet18 architecture, however its performance is inferior to the 34 layer version, which achieves a perfect clustering. Finally, the 50 layer network fails to yield results on par with the smaller architectures as it fails to generalize. This can be seen, since during training, EN is processing each sample-set independently with an abundance of ways, thus leading to overfitting. Overfitting can also occur when the training sample-sets are so distinct that the model struggles to fit them all, a phenomenon which is referred as the curse of dimensionality in the machine learning context [39]. Apart TABLE 1. Performance and duration of the three examined ResNet architectures during clustering on the COIL20 database. The most complex architecture needs 2.5× more time in order to achieve equal performance to the simplest one, and, is therefore rejected.
from the performance results, the most complex version of the network needs more time to implement clustering.
The first convolution layer of all ResNet implementations is applied with a large stride in order to scale down parameters in the following layers. During training, the learning rate is scheduled to decrease, starting from a value of lr = 10 −3 , and the sigmoid layer is omitted. Instead, the sigmoid function is calculated in conjunction with the network losses by the binary cross entropy of the network output and the target label, as this approach is reported to be more numerically stable [40]. The sigmoid layer is activated again during inference.

B. CLUSTERING AND MERGING THRESHOLDS
As described in III-C and III-D, CP depends on the clustering and merging thresholds v c and v m , which adjust how rigorous the method is towards accepting new items to an already formed cluster, or merging smaller clusters with larger ones. These threshold values essentially indicate the number of required votes prior to accepting or rejecting a clustering or merging candidate. Large v c and v m values represent strict threshold values, as more votes are required to make a decision. For instance, v c = 1 and v m = 1/2 denote that all items of a formed cluster must vote positively for the insertion of a new candidate into the cluster, whereas only half of the votes are required for merging. Tables 4, 5 and 6 illustrate how different v c and v m value combinations affect the performance of the method on the COIL20, COIL100 and MNIST-test databases. For this ablation study, only 50% of the overall dataset has been used, which justifies the slight variation in performance scores when compared to the overall final results in V-D. As it can be seen in all tables, a very strict threshold value combination produces an excessive number of small clusters, many of which comprise only one item (singleton clusters). This behavior is expected for the following two reasons; Firstly, a large v c value is responsible for preventing new items from being appended to existing clusters easily, which can heavily affect the precision score and secondly, even if some small clusters are formed, a large v m value restrains the merging mechanism from combining them into larger ones. In the opposite case, small threshold values result in less clusters comprising more items, that may, however, falsely include items of multiple classes, thus negatively affecting the recall score.

C. METRICS
The proposed clustering method is evaluated by adopting three measures. The normalized mutual information (NMI) [46], accuracy (ACC) as proposed in [37], and an F-score, as proposed in [47]. Although NMI has been broadly used in comparing different clustering approaches, it is a measure that requires the number of clusters that the method yields to be equal to the one of the ground truth. Since the presented method is based on the absence of this parameterization, the total number of clusters and consequently the number of items contained in a cluster is unknown. Thus, the NMI measure fails to precisely judge clusters with more or fewer items than the respective ground truth cluster. To calculate the NMI score of the proposed clustering method, an approach similar to the implementation of [37] is adopted after modifying the cardinality of a produced cluster |C k | to match the ground truth cluster's cardinality |K l | or vice versa by discarding items from either C k or K l . The decision on which cluster items will be discarded is established on the distance matrix between each item and the cluster's representative, generated for every cluster, as described in Section III-D. Equation (6) shows how NMI is calculated for a number of clusters that has not been predefined, after modifying the produced cluster's or ground truth's cardinality to q (where q = min(|C k |, |K l |)).
where H (·) expresses the entropy, MI (·, ·) is the mutual information of a cluster C k and ground truth class K l after modifying their cardinality to q k,l . Despite the fact that NMI is considered a standard clustering metric, some related work uses the adjusted mutual information (AMI) score instead, because of the known drawback of NMI to favor fine-grained partitions [48]. AMI is defined as: Another popular clustering measure is the cluster accuracy (ACC). ACC of a cluster to the ground truth is defined as where 1(C k = K l ) = 1 if C k = K l and 0, otherwise. Since there is no information provided on how the ground truth clusters are matched to the produced clusters, the best permutation is found by first constructing an N c × N c cost matrix and then solving the linear sum assignment problem. The fact that the ACC and NMI metrics are inadequate to provide accurate results is the intuition behind residing to a metric that simultaneously measures performance both quantitatively and qualitatively. For each produced cluster C k its precision is computed with respect to a ground truth cluster K l as prec(C k , K l ) = |C k ∩ K l |/|C k |, and the recall of C k with respect to K l is defined as rec(C k , K l ) = |C k ∩ K l |/|K l |. A high precision score implies that most items in the cluster are correctly grouped together in that cluster, but the algorithm may have missed samples that should have been included too. On the other hand, a high recall score means that the method has accurately clustered most images of a specific class but has also included samples that belong to other classes. Given the precision and recall of two clusterings, their F-score is defined as: It is safe to assume that for each C k , K l pair the highest F-score value is located where the mutual information is maximized. To conclude to a single score, an overall F-score is measured as: where N K is the number of ground truth clusters, N C is the number of clusters that the method extracted. Tables 2 and 3 show how the proposed work compares to other clustering methods. For each experiment, the clustering and merging threshold values, v c and v m , were selected based on the ablation study presented in V-B. Since there is a abundance of databases on which clustering methods are evaluated on, and the employed databases do not always match, the proposed method is evaluated on as many common databases as possible. More specifically, Table 2 displays the performance of the proposed method against other non-parametric methods. The significant increase in total process time across experiments indicates that the proposed method copes with handling larger datasets; CP needs to exclusively scan all items of the given dataset at least once (O(n)) and then repeat the process on the set of unclustered items as many times as needed, with the worst-case scenario being that all items remain unclustered, as the method yields only singleton clusters (O(n!)). However, the exhaustive nature of the proposed method leads to notable results, as it exploits the advances in classification methods and the increasing improvements in available hardware. As a future work, new strategies will be investigated towards avoiding the worst-case scenario, by either random cluster initialisation or clustering preprocessing for initial clusters. Table 3 illustrates the performance of our work against methods that also exploit the knowledge of the number of clusters (parametric methods). Despite the significant advantage of the parametric clustering methods, the method presented in this work is capable of being on par with their performance, and moreover, even surpassing them at certain datasets. In order to be able to compare to all methods, the efficiency of this approach is also calculated using F-score, ACC and where noted with *, AMI. For the parametric methods presented on Table 3, the selection of NMI, ACC or AMI is expected since the number of clusters is predefined. However, for non parametric methods, in order to employ these metrics, the conventions described in V-C should be applied. These conventions question the reliability of the NMI, AMI and ACC measures, when the expected number of clusters is unknown, thus this work considers the F-score as a more fair clustering metric.

E. CROSS DATASET CLUSTERING
In addition to the favorable results presented in Section V-D, the proposed method is further tested on cases where the Evaluation Network is trained on a database but the learned weights are used by the Clustering Process on another VOLUME 8, 2020 TABLE 3. Performance of the proposed method against state-of-the-art methods where the number of clusters is known a priori. Empty cells denote that no results were reported by the respective methods on the specific database.  dataset. The results on Table 7 illustrate that the proposed method is capable of providing adequate results even when CP is evaluated on data very different than the ones EN has TABLE 6. Performance (F-score) and number of extracted clusters of the method with various clustering and merging threshold combinations, using the ResNet18 architecture on the MNIST-test database. Very strict threshold values v c and v m , depicted in the first column, force CP to produce an excessive amount of singleton or two-item clusters, and consequently, a very low F-score. been trained on. The MNIST and FashionMNIST databases are identical in terms of size and structure, they, however, feature very different context. The fact that the FashionM-NIST database is a collection of clothing images and the MNIST-test database contains handwritten digits, justifies EN's ability to extract patterns from dissimilar modalities. Samples of both databases are illustrated in Figure 7. For both tests, the accuracy is calculated with F-Score and the weights model is produced by training EN on the lightweight 18-layer ResNet network. Cross dataset clustering tests are a real-world challenge, as data are not always available in volumes, as in public datasets.

F. CONCLUSION AND FUTURE WORK
This work presented a novel clustering methodology based on convolutional neural networks that does not require any number of clusters information to be provided beforehand. The approach demonstrated on this work outperforms most approaches that do not utilize this knowledge. In addition, it is on par or even outperforms the state-of-the-art regarding most parametric clustering approaches. Finally, the proposed method demonstrates the ability to cluster images from modalities that it has never encountered before.
Regarding future work, the proposed method can extend to apply to data of additional modalities such as sound or text, by employing related networks and fine-tuning the methodology accordingly. To ameliorate the extended runtime complexity, alternative algorithmic strategies can be approached for CP, as already examined in the literature, by combining the proposed method with well-known algorithms such as DBSCAN or t-SNE. Fusing effective clustering algorithms with the proposed pipeline does not modify the core methodology and objective of this work, that is, as described in III-B, the development of a strong, reliable clustering criterion. Finally, it has to be noted that the method's runtime speed is closely linked to the architecture of the employed network, as already shown in V-A. Although utilizing a simpler and more efficient network than the ResNet equivalent architectures will not reduce the algorithm's overall complexity, it can definitely help achieving a faster inference.