Explainable Impact of Partial Supervision in Semi-Supervised Fuzzy Clustering

Controlling the impact of partial supervision on the outcomes of modeling is of uttermost importance in semi-supervised fuzzy clustering. Semi-Supervised Fuzzy C-Means (SSFCMeans), a specific model we consider, uses a single hyperparameter called a scaling factor <inline-formula><tex-math notation="LaTeX">$\alpha$</tex-math></inline-formula> to weigh the impact of partially labeled data. This concept became widespread and was reused directly in many works building on SSFCMeans, or even applied to other fuzzy clustering algorithms, such as Possibilistic C-Means. However, none of the works challenged the original interpretation of <inline-formula><tex-math notation="LaTeX">$\alpha$</tex-math></inline-formula>, which suggests that the impact of partial supervision is directly proportional to the scaling factor. We fill the above-mentioned research gap and thoroughly analyze this relationship. We provide novel explanations of the scaling factor <inline-formula><tex-math notation="LaTeX">$\alpha$</tex-math></inline-formula> in terms of the key element of fuzzy clustering—the membership values. We prove that the impact of partial supervision is a nonlinear function of <inline-formula><tex-math notation="LaTeX">$\alpha$</tex-math></inline-formula>. Our approach is rooted in the explainability framework, which distinguishes interpretation from an explanation and treats the latter as superior. Explaining the scaling factor leads to an explainable impact of partial supervision and enables greater control of it. Finally, built on the novel explanations, we propose a unified, analytically justified framework for selecting the value of the hyperparameter <inline-formula><tex-math notation="LaTeX">$\alpha$</tex-math></inline-formula> that is based on the cross-validation approach. We illustrate that the proposed framework enables an extensive analysis of the impact of partial supervision in SSFCMeans with a simulation experiment.

M observations out of all N observations (M < N) is obtained.This additional information is hence called partial supervision.
In our scenario, it is given in the form of a label y ∈ {y 1 , . . ., y c } denoting the class to which an observation belongs.
The class of semi-supervised fuzzy clustering (SSFC) models adapted to handle this type of partial supervision that we regard 1) is based on the partitioning approach where the number of clusters c ≥ 2 is fixed and 2) defines similarity as a distance between observations and clusters' prototypes measured by a metric d.These models are thus referred to as to distance-based SSFC models in the literature [2].The fundamental design choice of any SSFC model is how to manage the impact of partial supervision on the results of clustering: estimated degrees of memberships and clusters' prototypes.One technique of controlling the impact of partial supervision that we call the additive combination was introduced in [3].It relies on a special construction of the associated objective function that combines two components in an additive manner: the unsupervised one and the supervised one.Pedrycz and Waletzky [3] proposed the additive combination as an element of the Semi-Supervised Fuzzy C-Means (SSFCMeans) model, the adaptation of the famous unsupervised Fuzzy C-Means (FCM) described in [4].
Works in [5], [6], [7], [8], [9], [10], [11], [12], [13] extended SSFCMeans in different ways and modified the mechanism of handling partial supervision to various extents, but did not change the core idea of additive combination nor its interpretation.Works in [14], [15], [16], [17] wrapped SSFCMeans to analyze data streams, primarily in the problem of monitoring bipolar disorder.Works in [18], [19], [20], [21], [22] explored safe semi-supervised clustering aiming at handling mislabeled instances (label errors).Kmita et al. [23] developed a procedure to estimate the uncertainty of labels resulting from an indirect annotation process.Last but not least, the very idea of the additive combination was applied to unsupervised fuzzy clustering models alternative to FCM.These include Possibilistic C-Means (PCM) proposed in [24], and a mixture of FCM and PCM called Possibilistic Fuzzy C-Means (PFCM) [25].The core unsupervised models, FCM and PCM differ in the implementation and interpretation of the soft assignment mechanism.PCM, just like FCM, was studied and modified by many researchers, including [26] who proposed repulsive PCM.A semi-supervised version of repulsive PCM was proposed in [27], and a semisupervised adaptation of PFCM was described in [28].
All the aforementioned SSFC models share the same way of controlling the impact of partial supervision formulated originally in [3], although it may not be phrased directly (as different naming conventions are used).This impact is controlled with a single hyperparameter of the algorithm that we call a scaling factor after [3] and denote it with α.Pedrycz and Waletzky described the role of this hyperparameter α "(...) is to maintain a balance between the supervised and unsupervised component within the optimization mechanism" [3, p. 789].They did not quantify the impact nor discuss the relationship between the value of α and the key outcome of the SSFCMeans algorithm: the degrees of membership, and none of the positions in the literature that reused the additive combination technique in the sense of Pedrycz and Waletzky's [3] work explained this relationship either.
The main contribution of this work is to comprehensively explain the role of the scaling factor α in SSFC because the existing descriptions can be treated only as interpretations of it.The distinction between these two terms is receiving close attention in statistical learning [29], [30], [31], with an explanation perceived as superior to an interpretation.We also postulate to unambiguously quantify the impact of partial supervision in the form of a function of α denoted as IPS(α).
Explainable models are especially important in healthcare data modeling, and are often referred to as eXplainable AI; such models enable the comprehension of the reasoning underlying the predictions that they produce.The motivation for this work arose also from the previous work of the authors [23], where a procedure called confidence path regularization (CPR) was proposed.This procedure wrapped the SSFCMeans model to estimate label uncertainty in the semi-supervised problem of monitoring the health status of patients diagnosed with bipolar disorder.The scaling factor α is of key importance for this procedure, and considerations on the topic of CPR led to the conclusion that existing descriptions of α were not sufficient for the improvements of the whole procedure.
In this article, we fill the identified research gap and explain the impact of partial supervision in two core SSFC models.First, we study the aforementioned SSFCMeans model.The second model we investigate is called Semi-Supervised Possibilistic C-Means (SSPCMeans).We create it by applying the additive combination technique to introduce partial supervision to the classical PCM.SSFCMeans and SSPCMeans differ in the implementation of the soft assignment; hence, the explanation of the scaling factor will differ as well.Our explanations apply to any model extending either SSFCMeans or SSPCMeans.
The rest of this article is organized as follows.In Section II, we discuss preliminaries of SSFC.We present the additive combination technique, SSFCMeans, and SSPCMeans models in detail.In Section III, we formalize a difference between an interpretation and an explanation and provide two novel explanations of the scaling factor α. Section IV is focused on the practical considerations stemming from the novel quantification of the impact of partial supervision.Finally, Section V concludes this article.

II. SEMI-SUPERVISED FUZZY CLUSTERING PRELIMINARIES
We now introduce basic definitions related to the SSFC.Let j denote any observation (unsupervised or supervised), j = 1, . . ., N, and k denote a given cluster, k = 1, . . ., c.In addition to these indices, partial supervision requires to distinguish between 1) supervised observations indexed by i = 1, . . ., M and 2) unsupervised observations indexed by h = 1, . . ., H. A jth observation is represented by a p−dimensional feature vector x j ∈ R p , and a kth cluster is represented by a p−dimensional vector v k ∈ R p called a prototype of the cluster.In the rest of this article, d means the Euclidean distance.
The soft assignment of jth observation to kth cluster is usually expressed by a membership u jk ∈ [0, 1].This convention is used in FCM and all models building on it.However, PCM uses a typicality t jk ∈ [0, 1] convention to stress the fact that the interpretation of the soft assignment in PCM differs from the one used in FCM.For a cohesive presentation, we express a general concept of the soft assignment common for all SSFC models by memberships u jk when the specific details do not affect the overall reasoning.
The partial information itself is expressed in the form of a prior memberships matrix F = [f jk ] of the same dimension as memberships matrix U .Every cluster represented by a specific column of matrix F must be arbitrarily associated with a single class.To this end, we create a c−tuple Y = y 1 , . . ., y c out of the set {y 1 , . . ., y c } and associate kth column in F with kth label y k from Y .We define f jk as binary entries such that f jk = 1 only if jth observation is known to belong to kth cluster (associated with the y k label); otherwise f jk is equal to 0. An unsupervised observation h has all prior memberships f hk = 0 ∀k.Frequently, an auxiliary variable b j is used; b j = 1 iff jth observation is supervised.In our scenario, b j = c k=1 f jk , hence one could question if this variable is indeed necessary.The choice whether to use b j and how to include it in the model is a matter of subtle consequences that we discuss introducing relevant models in the remainder of this section.
With partial supervision introduced in SSFC, a need occurs for supervised observations to distinguish between their membership degrees to "unsupervised" and "supervised" cluster.By "supervised cluster" we mean "the cluster associated with the class y k that the observation is known to belong to."To retrieve this cluster, we define a function s(i) ∈ {1, . . ., c} that selects the index of the cluster associated with the ith supervised observation's class, i.e., f ik = 1 iff k = s(i).Further on, we discuss three distinct types of memberships as follows.
1) u hk membership of an unsupervised observation h to any cluster k. 2) u i,k =s(i) membership of a supervised observation i to any nonsupervised cluster.3) u i,s(i) membership of a supervised observation i to the supervised cluster s(i).

A. Additive Combination Technique
Since SSFC models modify the core unsupervised fuzzy clustering models, we introduce the additive combination technique by first considering a general form of unsupervised fuzzy clustering model parameterized by hyperparameters gathered in Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Θ.The optimization problem is formulated as arg min where Q is the objective function, U = [u jk ] N ×c is a memberships matrix, V = [v k ] c×p is a prototypes matrix, and X = [x j ] N ×p is a features matrix.SSFC models adapt the minimization problem from (1) by combining the unsupervised objective function Q with its counterpart Q S , which incorporates the partial supervision (hence the name additive combination), arriving at (2) where α > 0 is the scaling factor that controls the impact of partial supervision.Specific hyperparameters gathered in Θ differ by models, yet one hyperparameter is common for all of them.It is the "fuzzifier" m > 1 that controls the fuzziness of the soft assignments.Bezdek et al. [4, p. 70] described it as "the larger m is, the fuzzier are the membership assignments."In this article, we use the specific value m = 2.The justification is provided in [3, p. 789]: any value of m = 2 would result in a situation where the variables optimized were linked together in the form of a polynomial and numerical procedures would be needed to solve its roots.
In general, finding optimal (U , V ) per ( 1) is intractable and approximation algorithms are often used.A typical optimization procedure for fuzzy clustering is described in [32].It relies on fixing one variable and optimizing the other at a time.Such an iterative procedure is performed until a convergence criterion is met.The formulae for two variables Û and V are obtained by studying first-order necessary conditions for a global minimizer (U , V ) of a respective objective function.SSFC models can follow the same optimization procedure as long as functions J(U ) = J(U ; V, X, F, Θ) and J(V ) = J(V ; U, X, F, Θ) remain convex.Indeed, this is the case for the functions J SSFCM and J SSPCM introduced in the rest of this section.
The two models we discuss, SSFCMeans and SSPCMeans, draw heavily from those introduced in [3] and [27], respectively.Our subtle yet important modifications are discussed in Sections II-B and II-C.Whenever we present original equations from the referenced articles, we adapt them to follow the nomenclature introduced in this section.We annotate all formulae from [3] with subscript (or superscript) P 97.

B. Semi-Supervised Fuzzy C-Means
The objective function J SSFCM (U, V ; X, F, Θ) proposed in this article has a form where the first component corresponds to the objective function Q FCM of the classical unsupervised FCM model described in [32].The minimization problem to solve is thus where constraints (4b), (4c), (4d) are the same as in unsupervised FCM.In the following, we present the objective function J P97 from [3, Eq. ( 2)]: As opposed to J SSFCM (3) proposed in this article, it was only f jk that was multiplied by b j , not the entire expression (u jk − f jk ) 2 .Pedrycz and Waletzky [3] presented in detail a solution to (4) w.r.t U -but using J P97 , not J SSFCM .
We applied the same analysis as presented in [3], but for the objective function J SSFCM (3) proposed in this article, obtaining the formula for the optimal membership We do not present full derivation, referring the reader interested in details to [3].An important part of ( 6) is (7) that we call the data evidence.Note that the data evidence e jk is a function of the feature's vector x j , the prototypes, and the index of the cluster considered.Consequently, the membership u jk (6) is also a function u jk = u(x j , V, k, α), but the notation u jk is used for brevity.Let us now apply the generic formula from (6) to distinct types of the membership ûhk , ûi,k =s(i) , ûi,s(i) .First, consider an unsupervised observation h.In such case, b h = 0 and f hg = 0 ∀g.Then, (6) simplifies to For the unsupervised observation, there is no direct impact of partial supervision on the value of the membership.It depends only on the data evidence, just as in FCM [32, p. 66].
Investigating ith supervised observation and its memberships, we first consider a degree of membership to any nonsupervised The data evidence e i,k =s(i) is decreased by the factor of 1 1+α .It is a desired result of the partial supervision mechanism.Even if the data evidence were to support the belonging of the ith observation to the k = s(i) cluster, the additional information we possess would decrease this membership.
The aforementioned equations clarify the mechanism of the SSFCMeans, but it is the membership of the supervised observation i to the supervised cluster s(i) that is of major interest We can observe that it includes a data-invariant component α (1+α) that depends only on the value of the scaling factor.We now recall the formula for the optimal membership ûP97 jk presented in [3, p. 789] without equation number as While (11) differs from ( 6), the distinct types of memberships ûP97 hk , ûP97 i,k =s(i) ûP97 i,s(i) do not differ from their counterparts derived from J SSFCM and presented in ( 8), ( 9), (10).We leave the simple calculus confirming this statement to the reader and state that the difference between J SSFCM and J P97 does not result in different estimated memberships.However, this difference between the objective functions affects estimated prototypes V .Let us note that [3] did not permit partial supervision to influence clusters' prototypes, associating V P97 with formulae from unsupervised FCM.Therefore, to present the effect of treating b j differently in J SSFCM and J P97 , we derive V from scratch.Let us define where φ jk = u 2 jk + αb j (u jk − f jk ) 2 is called an individual contribution.We now find the stationary point of Optimizing J P97 (v k ), one would arrive at the similar equation to ( 13), but instead of individual contributions φ jk , there would be Let us compare the form of individual contribution φ jk and ω jk in three distinct types of the soft assignment and In the case of vP97 k , the individual contribution of the unsupervised observation ω hk is the same as the contribution of the supervised ω i,k =s(i) .It is undesired and does not occur in the case of φ hk and φ i,k =s(i) .Note that in SSFCMeans, u hk is not impacted by the scaling factor α in any way, and this is why we postulate the same for vk when considering a contribution of the unsupervised observation h.

C. Semi-Supervised Possibilistic C-Means
We now apply the additive combination technique from (2) to introduce partial supervision to PCM.The idea of PCM comes from a relaxation of the probabilistic constraint in FCM presented in (4b).To avoid a trivial solution where each membership was estimated to be 0, a special form of the objective function was proposed in [24] (16) where T = [t jk ] is a typicalities matrix, and vector Γ = (γ 1 , . . ., γ c ) T contains cluster-specific scalars γ k > 0. Note that [24, p. 101] allowed m ∈ (1, ∞), but recall that in this article we set m = 2.
The supervised component Q S SSPCM that we propose is the same as in [27] (17) Since we regard classical approaches in this article, we propose to combine (17) with (16) to obtain the objective function where the constraints (19b) and (19c) are the same as in the unsupervised PCM.Compared with our approach, Antoine et al. [27] combined (17) with the objective function of repulsive PCM, defined as However, since the objective functions ( 16) and ( 20) include t jk in the same way, the formula for the optimal typicality in SSPCMeans is thus the same as derived in [27] and presents as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Considering distinct types of tjk The optimal cluster's prototype in our SSPCMeans differs from [27], because (16) and (20) differ in treating V , and has a form where derivation is analogous to the one presented for SSFCMeans in the previous section.
A critical issue with Interpretation 1 is that the scaling factor α is considered only in the context of the objective function.We thus extend this definition and provide a discussion about a connection between the scaling factor and the outcome of the model, i.e., the estimated memberships matrix Û .
Furthermore, the descriptions of the scaling factor α in the literature are imprecise and inconsistent.In the following, we list selected citations that use naming conventions different than Interpretation 1: r "α be proportional to the rate N/M " [3, p. 788]; r "α parameter is set in such a way that two terms of the objective function have the same importance" [33, p. 57]; r "β is the impact intensity of the semi-supervised compo- nent" [13, p. 671]; r "where λ is the ratio of labeled sample points in the data sample" [11, p. 135]; r "where λ 1 and λ 2 are the regularization parameters which control the tradeoff between FCM and SSFCM" [22, p. 387].The terms "balance,", "intensity," or "tradeoff" may implicate the proportional impact of the scaling factor α on the outcomes of the model, but do not have to.There are no clear statements about the functional character of the impact in the corresponding articles.Only Pedrycz and Waletzky [3] used the word "proportional" directly, but they use it to establish α as a function of the data (the number of labeled observations), not to discuss how much α impacts the outcomes of modeling (regardless of the data).
The aforementioned problems lead to inconsistent processes of selecting the value of the hyperparameter α that are not justified analytically.The importance of the scaling factor α is clearly seen in (2).Regardless of the functional form of Q or Q S , the role of α is the same.It clearly impacts the estimated variables Û and V .

A. Differences Between Interpretation and Explanation
To distinguish between an interpretation and an explanation, we propose the following three criteria that an explanation of the scaling factor α must satisfy.
(C1) Interpretability.(C2) Completeness.(C3) Quantification.Any description that satisfies criterion (C1) and one more criterion (C1 or C2), but not all three criteria, is considered an interpretation.Gilpin et al. [30] provided two criteria for evaluating explanations: interpretability and completeness, which are referred to as (C1) and (C2) in this article.Criterion (C3) is our additional requirement specific for the scaling factor α: we want to express the impact of partial supervision as a function IPS(α).
Let us now elaborate on how to check criteria (C1)-(C3) for a given description of the scaling factor α.For (C1) interpretability, Broniatowski [29] states that "an interpretable model should provide users with a description of what a stimulus (a data point or model's output) means in context."Regarding SSFC, the scaling factor α is the stimulus we require to be put in a context.Moreover, it does not suffice to provide any context as an interpretable description should be "understandable to humans" [30].
(C2) completeness is satisfied when a description of the system's operation is accurate [30].We associate this criterion with a proposition from [29] "an explanation of a model result is a description of how a model's outcomes came to be."Note that in the case of SSFC, the key outcome of the model is the estimated memberships matrix Û .Taking all the aforementioned into account, we require a complete description of the scaling factor α to describe in an accurate way the relationship between α and Û .
Finally, criterion (C3) quantification stems from the need to numerically assess the difference between an impact of different α 1 and α 2 , α 1 = α 2 values on the results of SSFC model.Explainable impact of partial supervision must associate a function IPS(α) that allows calculation of a difference IPS(α 1 )-IPS(α 2 ).

B. Explanation of the Scaling Factor α in SSFCMeans
It is clearly shown in (10) that for ûi,s(i) , regardless of the data evidence, we are guaranteed that ûi,s(i Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
We propose to call the quantity α 1+α the Absolute Lower Bound to stress its nature.To the best of our knowledge, the absolute lower bound has not been discussed in the literature so far even though it is a straightforward conclusion that stems from wellknown equations and may significantly impact the outcomes of the model.Let us now formulate an explanation of the impact of partial supervision in SSFCMeans.
Explanation 1 (IPS in SSFCMeans): The scaling factor α quantifies the impact of partial supervision as IPS(α) = α 1+α , and establishes an absolute lower bound for a membership of a supervised observation to the supervised cluster u i,s(i) > IPS(α).

C. Explanation of the Scaling Factor α in SSPCMeans
Let us first consider an interpretation of the hyperparameter γ k provided in [24].
Interpretation 2: The value of γ k determines the distance at which the typicality value of a point in a cluster becomes 0.5.
It comes from the fact that if we consider a distance d 2 hk := γ k , then for the typicality in unsupervised PCM (22a) With the aim of providing an explanation of the scaling factor α in SSPCMeans, we will make a similar assumption and study the difference between: (I) a possibility of a supervised observation to the supervised cluster t i,s(i) from ( 22c) and (II) a possibility of unsupervised observation to any cluster t hk from (22a).
Let us consider arbitrary observation a and arbitrary cluster b.First, assume t (I) ab is unsupervised typicality to any cluster, as in (22a).We know that if we set γ b := d 2 ab , then the typicality t (I) ab is 0.5.
Suppose that we obtain the label of observation a so it becomes supervised, and b = s(a) happens to be the supervised cluster.Therefore, the typicality value takes form from (22c), and assuming Note that the only change includes the value of typicality t (II) ab : this is still the same observation a, the same cluster b, the same hyperparameter γ b , and the same fixed distance d 2 ab .Therefore, we can quantify the impact of partial supervision We can now propose an explanation of the scaling factor α. Explanation 2 (IPS in SSPCMeans): In the supervised case, the scaling factor α increases the typicality of a supervised observation to the supervised cluster t i,s(i) by IPS(α)= α 2(2+α) for the same distance γ s(i) at which the typicality in the unsupervised case was equal 0.5.Regarding the criterion (C2) completeness, let us recall that Interpretation 1 relates α with the objective function.The implicit statement "(...) a balance between the supervised and unsupervised component (...)" [3, p. 789] means in fact "a balance between the supervised and unsupervised component of the objective function."It is unclear from this interpretation how the outcome Û of the SSFCMeans model came to be since Interpretation 1 does not relate α to the variable ûjk .On the contrary, both Explanation 1 (SSFCMeans) and Explanation 2 (SSPCMeans) explain the scaling factor α in terms of its impact on the soft assignment variables by precise referral to the models' mechanisms.Explanation 1 relates IPS to the membership of a supervised observation to the supervised cluster u i,s(i) , and Explanation 2 discusses IPS in terms of a difference between the supervised typicality t i,s(i) and the typicality as if the observation was treated as unsupervised.

D. Checking the Criteria
Regarding the criterion (C3) quantification, Pedrycz and Waletzky [3] suggested that the value of α should be set to the rate M/N , relating it to the data.We enhance this proposition and express it in terms of the impact of partial supervision as a function IPS P97 (α) = α.Nonetheless, we show that the impact of partial supervision is not directly proportional to α.

IV. PRACTICAL CONSIDERATIONS
In the preceding sections, our analyses have contributed to the establishment of theoretically sound explanations regarding the impact of partial supervision in SSFC.In the context of SSFCMeans and SSPCMeans models, this impact is regulated by the scaling factor α. Despite the provided explanations, practical questions arise: which values of α should be used?How can one empirically assess the impact of partial supervision when fitting the model to the data?In this section, we build on the results from the preceding analyses and delve into these specific practical considerations.
The source code for reproducible simulations described in Section IV-B is publicly available on CodeOcean [34].In the absence of open-source implementations of SSFCMeans, we Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.implemented it in R language from scratch and made it publicly available on GitHub. 1

A. Constructing Cross-Validation Grids
A standard practice for selecting the value of a hyperparameter of any model is to cross-validate (CV) it, i.e., to create a K−tuple of K different values to be checked (called a grid), fit a model for each value, and finally find the best model with respect to some criterion; the selected value of the hyperparameter is the one associated with the best model.In the SSFC domain, a common CV approach is to select a few α values that divide the search space roughly equally.
For instance, Bouchachia and Pedrycz [8] tried grid B = 0.3, 0.5, 0.7, 0.9, 1 , whereas Antoine et al. [28] tried grid A = 0.01, 0.05, 0.1, 0.5, 1 .These CV grids cover the space of α values, since they implicitly follow Interpretation 1 and the associated proportionality assumption that was expressed as IPS P97 (α) = α.Such a function has a significant analytical disadvantage: it is bounded only from below.Theoretically, using Interpretation 1, one could think about increasing the value of α infinitely, expecting that each increase in α will result in the directly proportional increase of the impact of partial supervision.In practice, none of the works reviewed in Section III analyzed this issue, and a maximum value of α considered in CV rarely exceeds 1 (as can be seen in grid A and grid B ).
On the contrary, IPS(α) functions for both Explanations 1 and 2 do not suffer from such problems.They are nonlinear, monotonically increasing functions of α bounded from up and below.Their properties enable an analytically justified procedure tailored to creating CV grids for α in SSFC.One can analyze the derivative IPS'(α) and decide on a point where the decrease in IPS becomes negligible.We call this point a β boundary.Fig. 1 presents IPS functions together with derivatives IPS' = ∂IPS/∂α for SSFCMeans and SSPCMeans models.Fig. 2 contains the proposed Algorithm for selecting the α grid based on β boundary. 1[Online].Available: https://github.com/ITPsychiatry/ssfclustFig. 2. Algorithm for establishing cross-validation grid for α.

B. Empirical Impact of Partial Supervision
When working with data, a need frequently occurs to ascertain how the introduction of partial supervision alters the outcomes of modeling a given dataset when contrasted with lack of supervision, or when the impact of partial supervision is reduced by a certain factor (e.g., two-fold).We call it the analysis of the empirical impact of partial supervision, as it depends not only on the theoretical explanations but also on the specific data patterns.In the context of the SSFCMeans model, we postulate that the examination of the distribution of supervised memberships {u i,s(i) } i=1,...,M is not the optimal choice albeit an intuitive one.This is due to the combined theoretical and empirical nature of u i,s(i) from (10).Denoting this membership in the functional convention, we obtain It is the data evidence e i,s(i) that contains the truly empirical impact of the partial supervision, as it is the direct function of the data.Therefore, analysis of the distribution {e i,s(i) } enables a direct investigation of the extent to which the impact of partial supervision affected the prototypes, and consequently, the relative distances between the observation and these prototypes in a given model.Let us now illustrate the aforementioned approach in a concrete data analysis scenario.We consider a three-class semi-supervised problem, i.e., Y = y 1 , y 2 , y 3 .The data are simulated in a nested loop.The outer loop consists of sampling 100 observations for each class from a 2-D Gaussian distribution N 2 (μ k , Σ k ), where μ 1 = (5, 5) T , μ 2 = (7, 7) T , μ 3 = (9, 9) T , and (5,5).Each kth distribution is associated with the kth class.Such a procedure yields spherical, overlapping clusters, which are hardly separable.An outcome of this outer loop is a features matrix X [300,2] .Fig. 3 presents an example of such a matrix with colors and shapes denoting the classes of observations.An inner loop relies on randomly selecting 15% observations from each class that will remain supervised (leading to 45 observations treated as supervised in each simulated dataset).We performed ten outer loops with ten inner loops for each simulated X, arriving at 100 simulation runs.
We now build CV grids.Specifically, we compare a proposition from the literature grid B (α) = 0.3, 0.5, 0.7, 0.9, 1 [8] with grid IPS (α) = 0.12, 0.28, 0.49, 0.79, 1.22 that we constructed based on the Algorithm from Fig. 2 proposed in this work.The results for these grids are presented against a dense reference grid ref composed of 50 α values dividing the interval [0, 1.5] equally (the equivalent interval expressed in terms of IPS(α) is [0, 0.6]).Owing to grid ref , we can observe a global pattern that one typically does not examine due to time and computational resource constraints.
Fig. 4 presents the summary of the results of fitting the SSFCMeans model to the data from each simulation run r = 1, . . ., 100 for each α from the respective grid.We present the total median e(α) e(α) = Me e α,r=1 1,s( 1) , e α,r=2 The growth of ē(α) described previously is associated with the increasing quality of true clusters' prototypes estimation.Table II presents mean estimated prototypes coordinates V1 and V2 together with their standard deviations for models for three values of IPS(α): 0 denoting no supervision at all, 0.12 and 0.28 being two first entries from grid IPS that enable to grasp the trends in simulation results described previously.The total median ē(α) reaches a plateau at approximately IPS(α) = 0.28, since the SSFCMeans already identified the true clusters' prototypes.The model cannot result in higher median data evidence e i,s(i) despite the increasing impact of partial supervision due to the noise in the data.This is an example of the empirical impact of partial supervision deviating from the theoretical one.Finally, let us note the differences between grid IPS and grid B .The former splits the IPS(α) space in equal intervals and allows to identify a changing trend in the behavior of e(α) as compared with the latter, which covers a narrower interval of IPS(α).This specific Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.simulation scenario confirms the need for the analytically justified creation of CV grids presented in the Algorithm from Fig. 2.

C. Estimating Label Uncertainty
In the previous section, we knew the process generating the data, hence the obtained labels y i were certain.In practice, this process is typically unknown, therefore the certainty of the labels may be questioned.Pedrycz and Waletzky [3] proposed to handle this situation by incorporating a confidence factor conf j ∈ [0, 1] to the objective function of SSFCMeans.However, their approach requires assessing the uncertainty upfront.Frequently, such knowledge is not available, especially when the data annotation process is a complex one [23].
To overcome this problem, Kmita [23] proposed the CPR procedure to estimate the adjusted confidence factor conf i from the data.CPR wraps SSFCMeans, implementing the regularization assumption: highly certain supervised observations should be consistently assigned high u i,s(i) across varying values of i .A path of = 1, . . ., R models is fitted, each decreasing the default α uniformly for all observations by conf i = reg r ∀i.The adjusted conf i for ith observation is then obtained as a weighted summary of the memberships from R models where weights w r compensate for the decreased α r = α • reg r .Kmita et al. [23] proposed to use the proportionality rule, i.e., set w r = 1 r .The first four columns of Table III contain exemplary data required to calculate conf i in a CPR procedure composed of R = 3 steps.
Note that the aforementioned procedure is implicitly based on Interpretation 1 quantifying the impact of partial supervision as IPS P97 (α) = α, and hence may lead to inaccurate conclusions.For the example from Table III, the adjusted conf i = 0.43, and we conclude that this ith observation is not the most certain labeled observation, but definitely not the least certain one.However, if we focus on the information contained in two last columns of Table III, we clearly see that the aforementioned conclusion is inaccurate, as the data evidence is extremely low; this labeled observation should be thus considered as highly uncertain.This exemplary problem shows potential issues resulting from the use of incorrect quantification of the impact of partial supervision and motivates the introduction of explainability framework into the procedures such as CPR.

V. CONCLUSION
The scaling factor α weighs the impact of partial supervision in SSFC and thus has a substantial effect on the estimated memberships and clusters' prototypes.All the models building on the additive combination technique introduced in [3], ranging from semi-supervised adaptations of PFCM [28] to complex workflows that wrap the SSFCMeans model [23], share the same mechanism of regulating the impact of partial supervision by means of the scaling factor α.
We reviewed the existing interpretations of α and its relationship with the impact of partial supervision and concluded that these interpretations are imprecise.They lack completeness, since they interpret α only in terms of the objective function, not the membership degrees.They also suggest a directly proportional relationship between the impact of partial supervision on the memberships and the scaling factor, which we prove to be nonlinear.
Therefore, in this article, we introduced model-specific explanations of the scaling factor α for both SSFCMeans and SSPCMeans that overcome the aforementioned limitations.They fulfill the three necessary criteria of an explanation (interpretability, completeness, and quantification) that we proposed based on the discussions on the explainability framework [29], [30], [31].Each explanation defines an associated function IPS(α) that quantifies the impact of partial supervision on the memberships.
The benefits of using our novel explanations are substantial.Not only do the explanations clarify the role of α, but also prove its impact to be a nonlinear bounded function of α.This enables analytically justified procedures for selecting the value of α to use, such as building cross-validation grids based on IPS functions proposed in the Algorithm from Fig. 2. We also discussed the differences between theoretical and empirical impact of partial supervision, providing a simulation example to illustrate them.Explanation 1 is of particular importance for procedures that estimate label uncertainty such as CPR [23].The concepts of absolute lower bound and data evidence encourage treating label uncertainty with respect to the ALB rather than to the nominal supervised membership.
Finally, further assessment of modeling the impact of partial supervision in the spirit of the additive combination technique remains open for future work.First, Explanation 2 for SSPCMeans requires a simulation or real-life data experiment that we performed for SSFCMeans only.Finally, it seems a promising direction to assess if one could introduce custom flexibility into the shape of the absolute lower bound curve α 1+α from Explanation 1.

Fig. 1 .
Fig.1.Impact of partial supervision IPS(α) for α ∈ [0, 5] for both SS-FCMeans and SSPCMeans.The IPS(α) for SSFCMeans is shown as a solid blue line, and the corresponding derivative is shown as a dotted blue line.The IPS(α) for SSPCMeans is shown as a dashed red line, and the corresponding derivative is shown as a red dash-dotted line.

Fig. 3 .
Fig. 3. Example of a single simulated features matrix X [300,2] .The orange triangles represent data points belonging to class y 1 , the red diamonds represent data points belonging to class y 2 , and the blue circles represent data points belonging to class y 3 .
Table I contains a comparison of the Interpretation 1 with two new explanations of the scaling factor α proposed in this

TABLE I COMPARISON
OF THE INTERPRETATION 1 OF THE SCALING FACTOR α WITH TWO NOVEL EXPLANATIONS PROPOSED article with respect to criteria (C1)-(C3).First and foremost, all the descriptions considered in Table I meet the criterion (C1) interpretability.They put the role of the scaling factor α in a broader context of the model in a "human understandable" language.
∈ [0, 0.25], e(α) is growing approximately proportionally to α 1+α , which confirms the theoretical quantification of the impact of partial supervision.Starting at IPS(α) ≈ 0.25, the Fig.4.Simulation results for e(α) presented against IPS(α).Solid black lines represent Q1 and Q3 for grid ref , the gray area represents IQR, and white line represents total median.Red crosses represent total medians for grid IPS , and blue pluses represent total medians for grid B .The black dotted line corresponds to IPS(α) = α 1+α .

TABLE III DATA
FOR EXEMPLARY CPR PROCEDURE, α = 2