Dealing With Multipositive Unlabeled Learning Combining Metric Learning and Deep Clustering

Standard supervised classification methods make the assumption that the training data is fully annotated thus requiring an a-priory labelling process which is both costly and time-consuming. To relax this requirement, many different flavors of weakly supervised learning have been proposed. Among weakly supervised learning strategies, Positive Unlabelled learning (PUL) is gaining attention from the research community due to the wide spectrum of applications it can fit. However, the majority of research studies related to PUL only consider binary classification tasks while real-world applications commonly involve multiple categories. To deal with this limitation, Multi-Positive Unlabelled learning (MPUL) has been recently introduced to learn from examples labelled with multiple positive labels and a single unknown negative label. Up to today, only a limited number of research works were proposed to cope with this more general setting. In this paper, we propose a new MPUL framework based on deep learning strategies. Our framework, named ProtoMPUL (Prototype based Multi-Positive and Unlabelled Learning), combines metric learning and clustering strategies to model the set of positive classes as well as to characterize the unknown negative one. Experimental evaluations on real-world benchmarks considering recent MPUL competitors demonstrates that the proposed framework achieves state-of-the-art performances, thus supporting the validity of the proposed approach.


I. INTRODUCTION
Standard supervised classification methods make the assumption that the training data is fully annotated with the whole set of classes of interest, requiring an apriori money-and time-consuming labelling process that can be unaffordable and unrealistic in several real-world scenarios. To relax this strict requirement several weakly supervised learning settings [2] have been proposed. Among them, it is worth mentioning active learning [3], semi-supervised learning [4], multi-instance learning [5], learning with label noise [6] and positive unlabelled learning (PUL) [7].
In recent years, Positive-Unlabelled learning received growing attention from the research community. PUL objective is to learn a classifier considering an incomplete training set where only a portion of the positive samples have The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Callico . associated label information while no label is available for samples belonging to the negative class [8]. In other terms, under such a learning setting, the training set is composed of two parts, a labelled one containing only positive samples and an unlabelled one containing both positive and negative samples. This learning setting is of particular importance in many real-world applications [8].
As a practical example, let us consider an automatic diagnosis scenario where a system aims to predict if a patient has a particular disease. In this scenario, patients diagnosed with the disease are labelled as positives, while patients that were not diagnosed with the deseases are unlabelled since not being diagnosed is different from not having the disease [9]. The same problem occurs when a company wants to create an archive of researchers' home pages, using webcrawling techniques. Once downloaded, a web page should be classified to decide whether it is a researcher's home page or some other page. In such a context, the concept of the positive sample is well defined (the researcher's home page) while the negative concept is not well-established [7] because no real characterization of what is not a home page is supplied.
In a medical context, we might want to recognize vascular lesions starting from medical images [10]. In this particular case, accurately labeling vascular lesions could take more than one year and are then often left unlabelled, while it is relatively easy to assign positive labels to healthy individuals.
In all these scenarios, defining a method to exploit both positive and unlabeled samples could save time, money, human labor and the expert may focus his/her effort to only define what is good, avoiding the ungrateful task of recognizing what is not.
In the field of remote sensing, the problem is even more relevant since to deal with tasks like land cover or natural resources mapping from satellite images, samples can be supplied for a particular category of interest (i.e. urban, forest or wheat crop) while it could be seriously hard to identify negative samples that can completely describe the underlying landscape. In this scenario, PUL methods can be effectively used while standard supervised approaches simply cannot [11].
Although existing PUL frameworks are effective and demonstrate promising performance in a plethora of diverse applications, they share the limitation that only binary classification tasks are considered while, in many real-world scenarios multi-class classification problems are involved. As an example, large-scale e-commerce platforms have the objective to detect cyber security attacks, the malicious attacks can be considered as positive samples while benign transactions can be considered as negative ones. As there are normally multiple kinds of cyber security attacks, the positive samples can be categorized into multiple classes. Similar problems occur for personalized email filters. Some systems should allow some spam to pass through the system in addition to non-spam emails, which are also organized into several positive classes. In remote sensing classification analysis, samples can be supplied for multiple categories of interest without providing exhaustive coverage of the different underlying land cover classes. In this case, positive samples coming from many land cover classes are available while the negative class is hard to pinpoint to a single label.
To deal with such scenarios, Multi-Positive Unlabelled learning (MPUL) (see Fig. 1) was recently introduced [12]. In this learning setting, the available training data is composed of a labelled set of data composed of samples spanning over K − 1 positive classes of interest and an unlabeled set of data that contains samples coming from K classes (the K − 1 positive classes plus an additional negative one). The goal is to learn a multi-class classification model capable to categorize an unseen test sample, at inference time, in one of the K classes in which the classification problem is defined. While many frameworks and solutions were proposed for the binary setting, unfortunately, only few works [12], [13] have tackled the more general problem related to the multi-class scenario.
To cope with the under-explored Multi-Positive and Unlabelled Learning (MPUL) scenario, in this paper we propose a new deep learning-based framework especially tailored to cope with propositional (or tabular) data. Our framework, named ProtoMPUL (Prototype based Multi-Positive and Unlabelled Learning approach), combines metric learning and clustering strategies with the goal of model the set of positive classes on which the label information is available and, simultaneously, supplying also a characterization for the unknown negative one. ProtoMPUL involves three different stages: first, an autoencoder is trained to extract an initial data embedding; second, a metric learning strategy is adopted to stretch the manifold in which the data is embedded considering the available (positive) label information and; third, a deep clustering process is used to further refine the separability among the multiple positive classes and the negative one and, simultaneously, provides per-class prototypes. At inference time, the encoder network as well as the learnt prototypes are employed to classify previously unseen test samples.
To assess the behavior of our framework, we provide an experimental analysis of real-world benchmarks coming from different domains. The benchmarks for Multi-Positive and Unlabelled learning were generated by following a similar protocol as the one adopted by [12], [13]. Results show the effectiveness of the proposed framework w.r.t. recent competing methods especially tailored to deal with MPUL scenarios.
The rest of the paper is structured as follows: Section II discusses related work, preliminary definitions and technical background on Multi-Positive Unlabelled Learning are introduced in Section III, the ProtoMPUL framework is presented in Section IV, Section V describes and discusses the experimental evaluation. Section VI concludes and draws possible future works.

II. RELATED WORK
PU learning has been introduced in [8] where the problem has been defined and motivated as important for many practical applications where positive samples are cheap to acquire. For instance, it might be easy to acquire files of patients who have a particular disease, but it might be difficult to get the files of patients for which we have a negative diagnosis. At the same time, one could easily and cheaply get unlabelled data by using data for patients without a diagnosis. Due to its practical usefulness, PUL has recently been used in a wide variety of applicative scenarios [14] and has been further developed in many theoretical papers [8].
Methods to deal with the PUL setting can be roughly divided into three families [15]. The first family [7], [16] employs a two-step method where reliable negative samples are first selected and then used to train a traditional binary classifier. As the second step is a trivial application of binary classification, methods falling in this family mostly differ for the way they determine the reliable negative samples. For instance, in [7] the authors search features of positive samples that have a distribution that is markedly different in the unlabelled set. Reliable negative samples are then found by removing from the unlabelled set the samples having the positive feature signature. In [17] the authors propose to exploit probabilistic generative models to characterize the distribution of the positive samples, and to label as reliable negative samples those that are in the lowest density regions with respect to the positive ones. In addition, the proposed framework creates mixtures of generative models by adopting a bagging mechanism from the discriminative framework as an effective and cheap alternative to the classical Expectation Maximization strategy.
The second family of approaches formulates the PU learning problem as a cost-sensitive task [18], [19] where the errors on positive and negative samples are weighted differently. Here, differences between approaches can be significative, but mainly concern the schemas used for assigning the weights. For instance, in [19] a fixed weight is used on negative samples, while in [18] weights vary depending on the negative sample. In [20] a novel non-negative risk estimator for positive and unlabelled learning setting is introduced. The proposed estimator can be used to evaluate the risk for a set of symmetric losses (e.g., the mean-squared-error reduction) as well as train common binary classifiers for the case of positive and unlabelled learning.
The third family [21] models the unlabeled data as negative samples with label noise, thus PU learning reduces to a binary classification problem with one-sided label noise. [22] proposes an adaptive sampling framework for positive and unlabelled learning and for learning with label noise. The proposed framework iteratively estimates the class mislabeling probability with an adaptive sampling procedure which reduces the risk of selecting mislabeled instances for model training. Subsequently, it is able to construct generalizable models even when a large proportion of mislabeled instances is present in the data. [12] was the first work to propose an extension of the standard (binary) PU learning to the multi-class scenario, introducing the Multi-positive Unlabelled (MPU) learning setting. The proposed approach is based on a one-step method in which the MPU learning problem is modeled minimizing multiple convex loss functions acting on labelled and unlabelled data.
More recently, [13] presented a new MPU learning approach based on a risk estimator derived from the one proposed in [12]. The authors build on the observation that the risk estimator proposed in [12] was affected by overfitting issues possibly caused by the unboundedness of the estimators. The authors propose a bounded risk estimator that alleviates this problem and avoids possible biases.
In our work, we use deep autoencoders, metric learning and deep clustering to induce an embedding that simplifies separating the negative class from the positive one. It is similar in spirit, albeit very different in practice, to the approaches in the first of the positive unlabelled learning families introduced above. Also, while most of the research efforts in the PU learning literature have been devoted to cope with a binary setting in which only one positive class is available at training time (e.g., [8]), we concern ourselves with the problem of non-binary PUL settings, i.e., a setting where we have multiple positive classes and one negative class as those proposed in [11], [12], [23]. To this end, we build on the problem definition proposed in [12], but leveraging the combination of three deep learning techniques: the minimization of a metric learning loss, deep autoencoders, and deep clustering. As we show in the ablation study in Section V, each one of the three components we propose provides improved accuracy to the final solution.

III. MULTI-POSITIVE AND UNLABELLED LEARNING
The Positive and Unlabelled learning (PUL) setting [8] considers a scenario in which we dispose of a training dataset D = {P ∪ U } composed by a set P of positive samples and a set U of unlabelled samples. The unlabelled samples set U contains both positive and negative samples but their label information is not accessible. In this scenario, the PUL setting has the objective to exploit both P and U to learn a binary classification model allowing the assignment of a binary label (positive or negative) to new, previously unseen, samples.
The Multi-Positive and Unlabelled learning (MPUL) setting [12] generalizes the PUL approach to a multi-class scenario in which the positive set P contains samples belonging to K − 1 classes (with the associated label information) while the set U contains samples (but not labels) of all the K classes, i.e., the K − 1 positive classes plus samples belonging to the (unknown) negative class. In this scenario, the MPUL setting has the objective to learn a multi-class classification model from both P and U with the aim to classify previously unseen samples to one of the K classes.
More formally, in the MPUL scenario, we denote with P = (X l , Y l ), the positive examples. P is thus a pair formed by Each example x i is a vector in R d , and the corresponding label y i is an element of the set {1, . . . , K − 1}. In this setting, the additional set U = X u = {x i } N u i=1 contains no label information. With X we indicate the union of labelled and unlabelled samples X = {X l ∪ X u }. In the following N , N l , and N u represent the number of examples in the sets X , X l , and X u respectively. We emphasize that a sample x n ∈ X u can belong to any one of the K classes (K − 1 positive classes plus the negative one), but the label information is unknown at learning time.

IV. PROTOTYPE BASED MPUL
In this section, we introduce ProtoMPUL: a deep neural architecture (see Fig. 2) and the corresponding training algorithm to deal with Multi-Positive Unlabelled learning setting. The current architecture is tailored for propositional (tabular) data (e.g., all involved neural networks are fully connected). ProtoMPUL exploits metric learning and deep clustering based strategies in order to characterize the K − 1 positive classes as well as the unknown negative class. The result of the learning process consists of a prototype representation for each of the K classes. At the end of the process, classification can be performed by projecting the samples in the learnt embedded space and classifying them with the class corresponding to their closest prototype.
The algorithm to train ProtoMPUL goes through three different stages: Stage 1) an autoencoder is pre-trained on the full set of available data via a layer-wise incremental procedure [24]; Stage 2) the autoencoder is complemented by a metric learning loss allowing the system to integrate the information provided by the labelled samples; Stage 3) the class separability is reinforced by adopting a deep learning clustering approach [25] which, ultimately, provides the per-class prototype representation.
The three stages have the objective to modify progressively the manifold in which the original data is projected enforcing a cluster structure to separate all the classes involved in the classification problem.
The training algorithm for ProtoMPUL (see Algorithm 1) takes as inputs the set of positive and unlabelled samples with the associated label information (X l , Y l , X u ) and the total number of classes K . The algorithm is also parametric with respect to the stopping condition in the loops that characterize each stage. While more sophisticated convergence criteria could be devised, 1 in our experiments we simply use a fixed number of epochs for each stage and treat it as a user-defined parameter. The results of the algorithm are the parameters e of the encoder function enc e and the set of prototypes

Stage 1
Use greedy layer-wise pretraining to initialize e , d by descending the gradient: ∇ e , d L r (X ); Stage 2 while stopping condition 2 not met do update e , d by descending the gradient: ∇ e , d L r (X ); update e by descending the gradient: ∇ e γ m L m (X l , Y l ); end Stage 3 p = K-means(enc e (X ), K ); while stopping condition 3 not met do every T iterations do update the auxiliary target distribution B using (4); break if percentage of changed labels ≤ δ; end update e , d , p by descending the gradient: ; update e by descending the gradient: Stage 1: performs a greedy layer-wise pre-training of the autoencoder. The autoencoder will provide the embeddings of the samples that are central to our approach. The goal of this stage is to initialize the autoencoder to a sensible starting point by training it to autoencode the complete set of data (X = {X l ∪ X u }). In later stages the autoencoder will be refined using information from the label set Y l and from the clustering loss. In all our experiments the autoencoder is based on fully connected layers and has shape 500 − 500 − 2000 − 10 − 2000 − 500 − 500. The central (bottleneck) layer provides a new representation for the samples which are therefore embedded in some subspace of R 10 . The layers in the autoencoder are based on the ReLU [26] activation function. As previously mentioned, the autoencoder model is trained layer-wise following the procedure proposed in [24]. The layer-wise greedy strategy incrementally trains the encoder and the decoder networks adding one layer at a time, facilitating the parameters optimization of the deep autoencoder. In addition, for this pretraining stage, we adopt a denoising strategy [27] to learn the model parameters in which the autoencoder network has to reconstruct a particular sample from its corrupted version.
The objective function optimized in this stage is the reconstruction loss function L r over the whole set of samples X : where N is the number of labelled and unlabelled samples, enc e (·) (respectively dec d (·)) is the encoder (respectively decoder) network with parameters e (respectively d ).
Here and in the following · 2 denotes the L 2 norm. Stage 2: of the framework refines the autoencoder by alternating a gradient descent on the L r loss, with a gradient descent on a metric learning loss in which the label information associated with the positive samples is leveraged to learn a label-aware projection of the original data. The metric loss L m is defined as follows: where 1 y n =y n is an indicator function that returns 1 if x n and x n belong to the same class and 0 otherwise. d 2 nn is the squared euclidean distance between the embeddings of x n and x n while [·] + is the ramp function defined as [z] + = max(0, z) commonly used in the hinge loss or in ReLU units. β s and β d are two margin parameters that allow one to adjust the contribution of the two components of the L m loss function. The loss induces a penalty when the squared distance d nn between two samples of the same class (1 [y n =y n ] ) is larger than β s since in this case [d 2 nn −β s ] + would be positive. Similarly, it induces a penalty when the squared distance of samples of different classes is smaller than β d . Thus, the smaller β s is, the closer two samples of the same class are required to be to not incur in a penalty; while the bigger β d is, the more distant two samples of different classes are required to be. The goal of the L m loss function is to exploit the available label information Y l to stretch the geometric manifold induced by the embedded representation with the aim of integrating the class information. In so doing it forces arranging the samples belonging to the same class to be close together (minimizing the term [d 2 nn − β s ] + ) and samples belonging to different classes to be far away from each other (minimizing the term [β d − d 2 nn ] + ). We note that the L m loss can only be applied on the set of labelled samples X l , which explains the need of optimizing L m and L r separately.
Stage 3: starts with the initialization of the model prototypes with the centroids derived by a clustering step on the current embedded representation of X . In principle any distance-based clustering algorithm could be used, in our experiments we adopted the well-known K-means clustering algorithm [28] setting the number of clusters equal to K . The first K − 1 centroids are initialized in the center of mass of the embeddings of samples belonging to the K − 1 positive classes. The last centroid is initialized randomly selecting a sample with a probability proportional to the distance of the closest centroid.
After the initialization of the prototypes, the main loop in Stage 3 relies on an alternate optimization strategy where enc e , dec d , and the deep soft-clustering networks are tuned using X by descending the gradient ∇ e , d , p L r (X ) + γ c L c (X ), and enc e is further refined using the labelled samples by descending the gradient ∇ e L m (X l , Y l ).
More specifically, given the initial prototypes, in this Stage we exploit the deep clustering strategy pioneered in [25] as a way to further improve the data partitioning. The deep clustering approach starts by computing a soft assignment Q between the embeddings and the prototypes; then, based on these assignments, the embedded data representation as well as the prototypes are updated descending the gradients of the clustering loss L c . The loss is based on the Kullback-Leibler (KL) divergence between the distribution representing the soft-assignments of samples to the prototypes and an auxiliary target distribution B: The auxiliary target distribution B is computed every T iterations using the formula: whereq k = N n=1 q nk is used as a normalization factor so to avoid preferring bigger clusters. As emphasized in [25], distribution B is defined in terms of Q implying that the minimization of L c (X ) is a form of self-learning. In fact, in a self-learning setting, an initial classifier is used to label an unlabelled dataset in order to train itself on its own high confidence predictions. In our case the distribution B plays the role of the high-confidence predictions and by minimizing the Kullback-Leibler (KL) divergence between B and Q, one gets the information needed to further update the embeddings to improve the partitioning of the samples.
To compute the soft assignment, following [25], [29], we exploit the Student's t-distribution as a kernel to measure VOLUME 10, 2022 the similarity between points in the embedded space and the prototypes [30]: where enc e (x n ) is the embedded representation of the n-th sample, p k (respectively p k ) is the k-th (respectively k -th) prototype, and q nk is the soft assignment between sample x n and prototype p k . Such distribution forces the assignment to have sharper probabilities (closer to 0 or 1) by squaring the original distribution and then normalizing it [31].
To weight the contribution of metric learning and deep clustering, we multiply the metric loss L m and the clustering loss L c by the user-tunable hyper-parameters γ m and γ c , respectively.
The algorithm terminates providing the learnt prototypes p . The prototypes, along with the learnt encoder enc e allow classifying new samples by first mapping them into the embedded space and then assigning them the class of the nearest prototype: The time complexity of the framework is given by the sum of the complexities of the three stages it is built on. We observe that the time complexity of stages two and three are dominated by the complexity of the gradient descent (backpropagation) algorithm, which is O(YWNE), where Y is the number of layers in the network, W = O(R 2 ) is the number of weights per layer, R is the maximum number of neurons per layer, N is the number of examples and E is the number of epochs. Stage one is more costly because it involves the training procedure for the greedy layer-wise pretraining strategy, which repeats the training Y times yielding a total complexity of O(Y 2 WNE). In summary, the total complexity is just the sum of the three given complexities which is dominated by the O(Y 2 WNE) term.

V. EXPERIMENTAL EVALUATION
In this section, we introduce the experimental settings and datasets we have adopted to evaluate the proposed framework, as well as the results and the related discussion.
We provide several quantitative evaluations. In the first one, we provide an ablation study about the different components on which ProtoMPUL is built. In the second evaluation, we compare the proposed approach w.r.t recent competitors considering a setting similar to the one reported in [12]. In the third and fourth evaluations, we assess the sensitivity of the different competing approaches to the variation of the number of positive classes as well as to the variation of the number of labelled samples. Finally, we summarize and discuss information about the execution time of the different methodologies.

A. EXPERIMENTAL SETTINGS AND DATASETS
We consider recent state-of-the-art methods as well as reference methods: • our main competitors are two recent methods proposed in [13], named AREA (Alternative Risk EstimAtor) and UREA (Unbiased Risk EstimAtor). Both methods are based on the concept of an empirical risk estimator. While the former considers an (unbounded) estimator that can suffer from overfitting, the latter solves such a problem allowing a better generalization on unseen data; • similarly to what was done in [12], we consider a linear Support Vector Machine [32] (named Linear SVM) approach learnt on the original multi-class classification problem. This is not a competitor for our approach. Rather, we keep it as a reference method since labels for all the classes (including the negative one) are available at the training stage; • additionally, with respect to what has been done in previous studies, since deep learning approaches are nonlinear methods, we also consider a radial basis function Support Vector Machine [32] ((named RBF SVM)) as an additional reference method trained on the fully labelled data set. It is worth stressing that, while the first two approaches (UREA and AREA) are direct MPU learning competitors (using exactly the same amount of label information as well as the same learning setting as ProtoMPUL), the two SVM methods are deployed in the standard supervised setting: all samples in the training set are labelled and they cover the whole set of K classes. Their performances should then be taken as an upper bound of the possible performances an MPUL approach might achieve.
We evaluate the performances of the different approaches on nine standard multi-class classification tasks. The datasets characteristics are reported in Table 1. Given a dataset, for each positive class, we consider a 60/20/20 (train/validation/test) split of the associated samples. Among the training samples of a specific class, only half of them (i.e., 30% of the class samples) are associated to label information, while the rest are assigned to the unlabelled training set. The rest of the class samples belong to the validation and to the test set. Regarding the negative class, half of the samples are assigned to the unlabelled training set, while the rest are assinged to the test set. We note that at training time one does not know the identity of the negative class, hence hyper-parameters are only estimated using the positive classes and the validation set cannot contain negative examples. As usual, the validation set is used to choose the best hyper-parameter settings for the different competing approahces, while the test set is used to assess the ability of the different methods to generalize on unseen samples.
As evaluation metric, we choose the F1 score [33] since, even if not perfect [34], it is very popular in the evaluation of the predictive performances in class unbalanced scenarios.
To avoid possible bias due to the way in which datasets are split, we repeat the process described above 5 times and we average the obtained results.
Except for the SVM methods, all the other approaches (including ProtoMPUL) are learnt via stochastic gradient descent through the Adam optimizer [35] with a batch size equals to 256.
For each dataset and each method, we used a grid search to find the best hyper-parameters configuration using a validation set based on the positive samples only. Regarding Experiments are carried out on a workstation equipped with an Intel R Xeon R CPU E5-2643@3.30GHz, with 128 GB of RAM. No Graphical Processing Unit was employed during the experiments.

B. ABLATION STUDY
The first experiment we have conducted has the objective to validate the importance of the components of ProtoMPUL.
To this end, we compare the performances of ProtoMPUL to several of its ablations.
To choose the ablation settings, we considered only configurations that involve coherent subsets of the components that our framework adopts.
The summary of the different ProtoMPUL ablations are reported in Table 2. The first ablation (Abla 0 ) evaluates the appropriateness of the layer-wise training procedure associated to the first stage of ProtoMPUL. Successively, Abla 1 and Abla 2 are specially tailored to evaluate the importance of the set of loss functions employed in the second stage of our framework while, the remaining ablations (Abla 3 , Abla 4 and Abla 5 ) assess the interplay of the full set of loss functions in the third stage of our proposal.  Table 3 reports the results of the ablation study. We can observe that, generally, ProtoMPUL outperforms all its different ablations or it attains comparable performances. The only case in which a different trend is exhibited is related to the Sonar dataset. This is probably because this benchmark is the smallest one we have in terms of samples (around 200) and this factor negatively influences the ProtoMPUL train procedure. It is also worth mentioning that, as we show in Section V-C, all MPUL methods fail on this dataset, which might indicate a general problem with this dataset that also affect the ablation study.
We can note that the layer-wise strategy seems worthy of interest since Abla 0 is dominated most of the time by ProtoMPUL. For the rest of the ablations, the complete system is usually better, often by a large margin. This trend is violated only by Abla 5 on the OptDigits dataset, but the same ablation setting is much worse than the complete system on all the other datasets. Based on this evidence we conclude that all the components of ProtoMPUL are important and their interplay contributes to the state-of-the-art results that we show in the next Section. Table 4 summarizes the results, in terms of F1 score, obtained by the different competing approaches on the set of benchmarks introduced in Section V. We can observe that ProtoMPUL outperforms the direct competing methods (AREA and UREA) on the majority of the datasets. When this does not happen (FMnist), performances are still largely comparable. Regarding the comparison between ProtoMPUL and the two SVM models, we note that RBF SVM achieves better performances on almost all the datasets compared to Linear SVM. This clearly shows that the former represents a more robust and effective upper bound for the MPU learning approaches. It is also interesting to note that ProtoMPUL outperforms the Linear SVM method on some datasets (Landsat, Mnist and Optdigits) even though Linear SVM has complete knowledge about the involved set of classes. When compared to the other methods with complete knowledge of the classes (RBF SVM), ProtoMPUL achieves results that are not so far from them, thus demonstrating the quality of the proposed framework.

C. EVALUATION OF COMPETING APPROACHES
We assess the statistical significance of the obtained results using two statistical tests. We compute a Friedman test [36]   to assess if the difference in the observed accuracies is statistically significant and, successively we set up a one-tail independent Student's-T test to assess if the observed average F1 scores can support the hypothesis µ 1 > µ 2 , where µ 1 is the performance of the ProtoMPUL and µ 2 is the performance of a direct competitor (AREA or UREA). In all cases, we consider the test passed when it supports the alternate hypothesis at the 0.05 confidence level.
The Friedman test comfortably rejects the null hypothesis (the F1 scores would be the same regardless of the algorithm) at the confidence level 0.05 since the non-parametric statistical test provides us a p-value of 0.00178. Table 5 summarizes the results of the independent Student's-T test. Green values indicate results with a significance level better than 0.05 and red values indicate results that are not statistically significant. We can observe that the difference in the observed mean values are always statistically significant (at the given confidence level) when ProtoMPUL is compared with UREA. In the case of the AREA approach, the difference in the observed mean is not significant only 2 times out of 9. It is worth pointing out that: i) the first of these two cases (FMnist dataset, comparison with AREA) corresponds to the one result where ProtoMPUL is not better than the competitor, i.e., the result is actually a positive outcome for us (the test µ 2 > µ 1 also fails the test at the 0.05 significance level); 2) the second case (Sonar dataset) is one where all MPUL methods fail to learn anything useful.

D. SENSITIVITY TO THE NUMBER OF POSITIVE CLASSES
In this experiment, we evaluate the sensitivity of the approaches to the number of positive classes. To this end, we have chosen two datasets among those having the maximal number of labels (Pendigits and Semeion) and evaluate how varying the number of positive classes impacts on the classification performances. The number of positive classes has been varied using values from the set {2,4,6,8,9} (9 being the largest possible value given the labels in the unmodified dataset). Results are reported in Fig. 3.
These results show that ProtoMPUL achieves superior performances with respect to the competitors (AREA and UREA) in both benchmarks and almost all cases. When this is not the case, i.e., for the Pendigits dataset and number of classes equal to 4 and 6, the results are still largely comparable. More in general, we observe that all the MPUL methods have decreasing performances as the number of positive classes increases, but the decrease appears to be less severe in the case of ProtoMPUL.

E. SENSITIVITY TO THE AMOUNT OF LABELLED SAMPLES
In this experiment, we evaluate the sensitivity of the algorithms to the number of labelled samples on Pendigits and Semeion. We chose these datasets based on the following factors: i) these are the same two datasets we used in the previous experiment, this simplify the experimentation, allows for easier reporting, and keeping the choice fixed seems fairer; ii) the two datasets are both hard ones, as mentioned in the previuos section, all MPUL approaches have decreasing performances as the number of positive classes grows and these datasets are among the ones with the largest number of classes; iii) they are a large and a small dataset, allowing us to assess how the algorithms work in the two regimes.
In the experiment we varied the amount of labelled samples for the positive classes. Specifically, we let the percentage (with respect to the positive examples in the experiment reported in Section V-C) of positively labelled examples to vary in the set: {20%, 40%, 60%, 80%, 100%}. Results are provided in Fig. 4.
We can observe that ProtoMPUL outperforms the competitors (AREA and UREA) no matter the percentage of labelled samples from positive classes is employed. In addition, it exhibits a more stable behavior than the competitors.

F. COMPUTATIONAL COSTS
In this section, we provide an overview of the computational cost of training the tested approaches. The training time (in seconds) is reported in Table 6. We do not provide figures for the inference time because, as usual for neural networks, the inference time is negligible when compared to the training time. Also, for what it concerns ProtoMPUL specifically, the total cost is given by the cost of encoding the example using enc e and by the (negligible) cost of comparing the result with the set of learned prototypes. All methods have been run on CPU as already mentioned in Section V-A. ProtoMPUL is clearly the algorithm with the highest computational requirements, but it is worth noticing that it is also the one with the smallest variability between experiments. This is due to the fact that the time complexity of ProtoMPUL is dominated by the layer-wise pretraining we perform in the first phase, and this is largely affected by the number of layers; a parameter that is kept fixed in our experiments. In our opinisson, while the time performances are clearly not favorable to ProtoMPUL, they are still within reason and can be largely justified by the better performances of the learnt model.

VI. CONCLUSION
In this work, we have presented a new framework for Multi-Positive and Unlabelled learning (MPUL) for the classification of propositional (or tabular) data. Our framework, ProtoMPUL, combines deep metric learning and deep clustering approaches with the goal to model the set of K − 1 positive classes on which label information is available and, simultaneously, providing also a characterization of the unknown negative class. The training algorithm for ProtoMPUL is based on three stages. In the first one an autoencoder is incrementally trained to extract a preliminary embedding of the data. Successively, the learning procedure is complemented by a metric loss function with the aim to involve the available label information. The last stage integrates a deep learning clustering process to further enforce class separability and extract a set of prototypes (one for each of the K − 1 positive classes plus an additional one for the negative class). At inference time, the encoder network as well as the learnt prototypes are employed to classify previously unseen test samples.
The experimental comparison with the state of the art MPUL competitors on standard propositional datasets has demonstrated the quality of the proposed solution, while the in depth ablation analysis has highlighted that all the different components of ProtoMPUL play an important role in its performances. The experiments with a varying number of positive classes have shown that all MPUL methods tend to have decreasing performances as the number of positive classes grows, but also that ProtoMPUL is remarkably robust to this issue. The same kind of observations can be made about the performances of the algorithms when the number of labelled examples decreases: again ProtoMPUL has better overall performances and appears to be more robust than the competitors. All these benefits have to be counterbalanced by a larger computational demand for training the model. Several possible research ramifications are possible for future works. Among them, we plan to extend our framework to work with other kinds of data (e.g., images, multi-variate time-series) adapting the autoencoder network to the speci-ficity of the input data. We also intend to extend the proposed methodology to situations and scenarios in which multiple unknown negative classes can be present. Finally, to reduce the computational demands of the algorithm, we plan to optimize phase 1, either by replacing the layer-wise pretraining with some alternative strategy less computationally demanding, or by changing the neural architecture to avoid the pretraining altogether.