Constrained Clustering: General Pairwise and Cardinality Constraints

In this work, we study constrained clustering, where constraints are utilized to guide the clustering process. In existing works, two categories of constraints have been widely explored, namely pairwise and cardinality constraints. Pairwise constraints enforce the cluster labels of two instances to be the same (must-link constraints) or different (cannot-link constraints). Cardinality constraints encourage cluster sizes to satisfy a user-specified distribution. However, most existing constrained clustering models can only utilize one category of constraints at a time. In this paper, we enforce the above two categories into a unified clustering model starting with the integer program formulation of the standard K-means. As these two categories provide useful information at different levels, utilizing both of them is expected to allow for better clustering performance. However, the optimization is difficult due to the binary and quadratic constraints in the proposed unified formulation. To alleviate this difficulty, we utilize two techniques: equivalently replacing the binary constraints by the intersection of two continuous constraints; the other is transforming the quadratic constraints into bi-linear constraints by introducing extra variables. Then we derive an equivalent continuous reformulation with simple constraints, which can be efficiently solved by Alternating Direction Method of Multipliers (ADMM) algorithm. Extensive experiments on both synthetic and real data demonstrate: 1) when utilizing a single category of constraint, the proposed model is superior to or competitive with state-of-the-art constrained clustering models, and 2) when utilizing both categories of constraints jointly, the proposed model shows better performance than the case of the single category. The experimental results show that the proposed method exploits the constraints to achieve perfect clustering performance with improved clustering to $2-5$ % in classical clustering metrics, e.g., Adjusted Random Index (ARI), Mirkin’s Index (MI), and Huber’s Index (HI), outerperfomring all compared-againts methods across the board. Moreover, we show that our method is robust to initialization.


I. INTRODUCTION
Clustering is the task of partitioning data into different clusters, based on some specific cluster assumptions.For example, K-means and Gaussian mixture models (GMM) assume each cluster is sampled from a Gaussian distribution.In contrast, density-based clustering assumes that the densities of data points in different clusters should be different, such as Chameleon [1] and AITC [2], or clusters should be partitioned at low density regions [3].However, if the adopted cluster assumption is not suited to the target dataset, this may result in a poor performance.To avoid such performance instability, prior knowledge or constraints on the data can be used to guide the clustering process.These constraints are independent of cluster assumptions, and they provide weak supervision to reflect user preferences.Thus, clustering with constraints, called constrained clustering, [4]- [6], is expected to give better and more stable performance than unconstrained clustering.
Two main categories of constraints have been widely studied in the field of constrained clustering, namely pairwise and cardinality constraints.Pairwise constraints may arise from some form of perceived similarity between samples.For instance, the continuity property is a form of Pairwise constraints that suggests that neighbouring samples are likely to be clustered together and vice versa.Thus, Pairwise constraints include must-link and cannot-link constraints.Must-link constraints enforce that a set of pairs of instances should be in the same cluster, while cannot-link constraints enforce that they belong to different clusters.Thereafter, this category can be viewed as instance-level constraints.On the other hand, Cardinality constraints provide extra knowledge on the size distribution of all clusters.This sort of constraints become particularly necessary in clustering tasks of data that is high dimensional and sparse with many clusters to assign [7].This often leads to solutions of empty clusters or unbalanced cluster assignments.Balancing constraints that lead to equal sized clusters are only a special case of Cardinality constraints.This category in general can be viewed as cluster-level constraints.
Many clustering methods have been proposed to utilize one of the two categories of constraints, such as the ones with pairwise constraints [4]- [6], [8]- [13], and the ones with cardinality constraints [7], [14]- [16].However, in some cases, one might want to enforce the continuity property among a set of points and the same time requiring to have solutions of balanced or user specified cluster sizes.In general, both constraints can be provided simultaneously, as they are derived from different sources.For example, pairwise constraints are usually obtained from an oracle query, while cardinality constraints can be obtained from experience or user preference.Moreover, they represent supervision at different levels.Each of them can provide particularly useful information that is not covered by the other.Thus, having both sets of constraints together in a clustering task should signifecently improve the performance and to the best of our knowledge, there is no existing work that can seamlessly incorporate both categories jointly.
For existing constrained clustering methods that handle one constraint category, it is non-trivial to directly add the other.For example, embedding cardinality constraints into the COP-KMEANS [13], will lead to instability in performance where COP-KMEANS will often fail in finding a feasible solution.This is because COP-KMEANS is very sensitive to cluster initialization.Moreover, it is also not easy to embed the pairwise constraints into normalized/ratiocut [16], which exploits balanced distribution constraints.In short, existing models are designed to exploit one category of constraints at a time.
We propose a unified model to incorporate both categories of constraints to guide the clustering process.Specifically, we start from the formulation of the standard K-mean method, and formulate cardinality constraints into linear constraints and pairwise constraints into quadratic constraints.Then we obtain a discrete optimization problem with quadratic constraints, which is difficult to be solved by off-the-shelf optimization methods.Thus we propose to utilize two techniques.One is to equivalently replace the binary constraints by the intersection of two continuous constraints, which was firstly proposed in [17].The other is to transform the quadratic constraints to bi-linear constraints by introducing extra variables.Our key contributions revolve around the new novel continuious reformulation for the K-Means problem allowing to incorporate both cardinality and pair-wise constraints.The reformulation is simple, flexible, and enjoys nice convergence properties with competitive performance.The contributions can be summarized in three folds.
• We embed both pairwise and cardinality constraints into one unified clustering model.To the best of our knowledge, this is the first attempt in the field of constrained clustering.
• We propose to equivalently transform the binary and quadratic constraints in the original problem to continuous and bi-linear constraints, to obtain a simple continuous reformulation.
• We conduct various experiments on several synthetic and real datasets comparing against 5 different algorithms.Our approach demonstrate competitive edge over all compared-against methods in the final clustering performance while respecting the imposed constraints.

II. RELATED WORK
Here, we briefly review existing clustering models that utilize pairwise or cardinality constraints.
Pairwise Constraints.They were first introduced into clustering in [4] and [13].In [4], a method called COP-COBWEB inserted the pairwise constraints into the clustering process of the incremental clustering method COBWEB [18], which utilizes four operators (i.e.add, new, merge, and split) to maximize the intra-cluster similarity and the inter-cluster dissimilarity.In each operator of COP-COBWEB, the given pairwise constraints are checked to ensure the satisfaction of all constraints.In [13], the method COP-KMEANS checks the pairwise constraints in each assignment step of K-means.Both COP-COBWEB and COP-KMEANS treat the pairwise constraints as hard constraints (i.e.all constraints must be satisfied), and the constraints are somewhat independent of the original objective.A common limitation of these two methods is that the processing order of instances influences clustering performance, and sometimes they may even fail to output a feasible partition.
To avoid this limitation, many methods treat pairwise constraints as soft constraints (i.e. a subset of these constraints could be violated) to develop more flexible approaches to embed constraints.For example, in constrained completelink (CCL) [5], pairwise constraints are used to modify the instance proximity computed in the original feature space.Then, standard complete-link clustering is applied using the modified proximity matrix.Penalized probabilistic clustering (PPC) [10] uses pairwise constraints as a prior term w.r.t. the cluster labels within the underlying GMM-based model.Clustering configurations not satisfying the constraints have a lower probability.Moreover, HMRF-KMEANS [19] embeds pairwise constraints as correlations between cluster labels in a hidden Markov random field (HMRF).A metric learning step is added into standard K-means to encourage gradual satisfaction of pairwise constraints.Other methods propagate pairwise constraints via instance similarity to obtain soft constraints, such as constrained spectral clustering [8] and HMRF-pc [12], [20].
Cardinality Constraints.They are widely used to guide the clustering process.Balancing constraints are a special type that encourages all clusters to be balanced in size or in connecting weights.For example, normalized cuts [16] divides the standard cut cost (sum of edge weights connecting the two clusters) by the sum of edge weights between each cluster and all other instances.Hence, each cluster is encouraged to have similar edge weights connecting to other clusters.Similarly, ratio cut [21] normalizes the cut function by the size of each cluster to encourage similar sized clusters.Equi-sized Fuzzy c-means (FCM) [15] formulates the balancing constraints as equality constraints, where the size of each cluster equals to the average cluster size.More general cardinality constraints have also been explored.For example, a constrained K-means method [7] sets a lower bound on the cluster size, to avoid very small or empty clusters that occur in standard K-means.An extension of the Equi-sized FCM is proposed in [14], where the size of one single cluster is set to a specific size.
To the best of our knowledge, the only clustering frameworks that enable both sets of constraints (Cardinality and Pairwise) either target a very specific class of methods that suffer from the locality property [22], or are greedy heuristics that propagate constraints [23].Clustering methods that suffer from the locality property result in clusters located partially or entirely outside the Voronoi cell of their centers [22].However, popular methods like K-means, K-medians, and many others always satisfy the locality property by definition, thus, limiting the theoretical results of [22] to a smaller class of clustering methods.There has also been an attempt to use standard constraint propagation methods to enforce both classes of constraints [23].However, this is done in a greedy heuristic fashion that may often fail in finding a feasible solution.Therefore, we believe that the combination of both pairwise and cardinality categories into a constraint generic and unified clustering model that can be systematically solved (ie using a flexible continuous optimization framework) has not been explored in any existing work.

III. PROPOSED METHOD
Unlike previous methods that can only handle either pairwise or cardinality constraints, we show, in this section, a detailed derivation of our framework that embeds both constraints simultaneously.In fact, this formulation is flexible and generic enough to handle any other linear equality.In our framework, we adopt the K-means integer program formulation [24] expressed as follows: where s p ∈ R d is the p th data point to be clustered and k is the number of clusters.The variable x ij defines the binary association between data point i and cluster j.The constraint ∑ k j=1 x ij = 1 enforces data point i to belong to one and only one cluster.This constraint can be simply written as a matrix vector multiplication: Ψ ⊺ x = 1 n , where Ψ ⊺ ∈ R n×nk is a binary matrix that has in each row a vector 1 ⊺ k that sums all the binary labels for a given data point while the rest are 0.
To simplify the fractional objective, we introduce variable w pj , such that: x pj = w pj ∑ n l=1 x lj .For ease of notation, we concatenate all the binary labels x ij into one vector ordered by the data points one at a time as follows: We also concatenate and reorder the w pj values one cluster at a time: w ⊺ = (w 11 ⋯ w n1 ) ⋯ (w 1k ⋯ w nk ) ⊺ .A matrix P ∈ R nk×nk is used to swap the order of the binary vectors from a cluster based order to a data point order and vice versa.Note that P is a proper permutation matrix that is symmetric and it satisfies: PP ⊺ = I nk .Thus, the compact form of unconstrained K-means can be re-written as follows: where C ∈ R nk×nk sums the binary labels of each cluster and is defined as follows: Cardinality Constraints.They are enforced by a set of linear constraints that specify the cluster size as: where u j is the size of cluster j and Q ∈ R k×nk sums the binary labels of each cluster for all data points.
Must-Link Constraints.We define E 1 , E 2 ∈ R kv×nk as selection matrices that choose the two sets of data points (E 1 x and E 2 x) involved in the v must-link constraints.We show next that the set of all must-link constraints can be expressed with a single quadratic. x Proposition 1.For the binary association x ∈ {0, 1} nk between n data points and k clusters, where x is binary and that each data point is associated to only one cluster (i.e.Ψ ⊺ x = 1 n ).This concludes that only one quadratic constraint can be used to enforce all must-link constraints.Cannot-Link Constraints.We define E 3 , E 4 ∈ R ke×nk as selection matrices for the two sets of data points (E 3 x and E 4 x) involved in the e cannot-link constraints.Similar to before, we show that the set of cannot-link constraints can be expressed with a single quadratic. x Proposition 2. For the binary association x ∈ {0, 1} nk between n data points and k clusters, where Incorporating Eqs (3), ( 4) and ( 5) into ( 2) we obtain the following constrained K-means formulation: where S ∈ R d×n contains all the data points in its columns and Λ j ∈ R n×nk is zero everywhere except for the j th block that is identity, i.e.Λ j = 0 ⋯ I j ⋯ 0 .ADMM Solver.Problem ( 6) is still difficult to solve due to the mixed binary and quadratic constraints.To handle these difficulties, (i) we first replace the binary constraints with an exact equivalent set that is the intersection of the 2 -sphere (defined by set S 2 ) and box constraints (defined by set S b ) following [17].(ii) Moreover, by introducing the auxiliary variables (z 1 , z 2 , z 3 , and z 4 ), the quadratic constraints are now changed to bi-linear ones and separated from the binary constraints.Thus, the resultant problem is given as follows: which can be solved using in the standard ADMM frame-Algorithm 1: ADMM for Solving Problem (7) Input : Set S ∈ R d×n .Set ρ 1−9 , y 1−5,7,9 = 0, y 6,8 = 0, x kmeans , w = P ⊺ x ⊙ diag −1 (Cx)1 nk .Output: x while not converged do update: x by solving Eq (9).update: w by solving Eq (10).update: z 1−4 via Eqs (11,12,13,14).update: y 1−5,7,9 , y 6,8 via Eqs (15).end work.Let L ρ1−9 be the augmented Lagrangian function of problem (7).We define it as follows: where the y variables are the Lagrange multipliers of the corresponding constraints, I is the indicator function that penalizes infeasible z 1 and z 2 , and ρ 1−9 ≥ 0 are the penalty parameters.In our experiments, we set all the ρ coefficients to the same value.The iterative ADMM steps for problem (7) are described in Algorithm 1. ADMM updates are performed by optimizing for the set of primal variables one at a time, while keeping the rest of the primal and dual variables fixed.Then, the dual variables are updated using gradient ascent on the corresponding dual problem.
We next show the final updates for each subproblem, but the exact derivations are found in the supplementary material.
a: Update x : We need to solve the following linear system using the conjugate gradient method.We need to solve the following linear system using the conjugate gradient method.
c: Update z1: Here, we need to perform a simple projection onto the box: This projection is an elementwise clamping between 0 and +1.
d: Update z2: We need to perform a simple projection onto the 2 -sphere: This involves an elementwise shift and 2 vector normalization.
e: Update z3: We need to solve the following linear system using the conjugate gradient method.
f: Update z4: We need to solve the following linear system using the conjugate gradient method.
g: Update y1, y2, y3, y4, y5, y6, y7, y8, y9: Lastly, we need to perform dual ascent on the dual variables as follows: The ADMM iterations are run until convergence (i.e. when the standard deviation between the last 10 objective values is ≤ 10 −5 ).Upon convergence, all the primal variables (x and z 1−4 ) converge to the same feasible binary vector.Despite that the problem is non-convex, we show empirically in the experiments' section and in the supplementary material that the performance using is very stable.

IV. EXPERIMENTS
In this section, we conduct extensive experiments to motivate and evaluate our proposed clustering method, both on synthetic and real datasets.We also compare our method against other constrained clustering methods on well-known benchmarks, thus, demonstrating superior performance and flexibility, as well as, superior gain that can be achieved when both categories of constraints (cardinality and pairwise) are combined in our framework.1. Datasets and Implementation Details.The datasets used in this section vary from synthetic to real.As for the synthetic ones, we construct two datasets, one is cluster balanced (denoted as Balanced) and the other is imbalanced (denoted as ImBalanced) as shown in Figure 1.Each dataset comprises 700 data points with 2 clusters.In Balanced, each cluster has exactly 350 data points, while in ImBalanced one cluster has 600 data points while the other contains 100.As for the real datasets, we make use of various popular UCI datasets [7], e.g.iris, wine, glass, ionosphere, Hepatitis, Hepatitis1 and Breast Cancer Wis-D.These are the most popular UCI datasets used for clustering purposes [8], [13].Following convention, data points are normalized to have a value in [−1, +1].For Hepatitis and Hepatitis1, we remove all points with missing or none categorical features.Table 1 lists the details of all UCI datasets used in the experiments.
As for the implementation details, none of the selection matrices used in the proposed framework (i.e.E 1 , E 2 , E 3 , E 4 , P, C, Q, Ψ, Λ j ) are actually constructed.Only element indexing within vectors is used, thus, keeping the necessary computation cost minimal.For ease, all ρ i parameters have the same value and updated similarly.We find that setting all ρ i parameters to 20 and by increasing it it every 5 iterations by 10% for all real datasets achieves the fastest convergence.Moreover, we initialize all the opti-  mization variables using zero vectors, while x is initialized to random (i.e.random assignment of data points to clusters) if the comparison is against K-means.When comparing against other clustering methods, we use the same K-means initialization as other methods.In all comparisons, w is initialized to a feasible point as given in Algorithm 1. MATLAB is used to implement our method.The most expensive operation in  our framework is the x and w updates, which involve solving an n×k linear system.This is the bottleneck of our framework causing it to have a computational complexity O(n 3 k 3 ) per iteration.In the final experiment, we report the runtime of our framework on different sized datasets with a variety of constraint choices.
As for the evaluation metric, we adopt the 3 most common criteria used in the clustering community to compare different clustering methods, namely the Adjusted Random Index (ARI)(↗), Mirkin's Index (MI) (↘) and Hubert's Index (HI) (↗) which calculate a measure of agreement between two partitions of a dataset [25], [26].The symbol ↗ indicate that the higher the number the better performance and vice versa for ↘.In all experiments, clustering is repeated 10 times with different initializations and we report the average and standard deviation of the metric used in comparison.2. Comparing Different Constraint Design Choices.We apply our proposed method, p Km, on the same clustering task with several choices of constraints: no constraints, only cardinality constraints, only pairwise constraints and both types jointly.We refer to each as p Km, p Km-Car, p Km-Pair and p Km-Mix respectively.(i) An Auxiliary Experiment.Despite that we do not provide a proof for the convergence of the non-quadratic objective in Eq. 1, as it is proven for the quadratic case in [17], we find the performance very stable where we demonstrate it empirically.For instance, we run p Km-Mix that enforces cardinality, 20 must-link and 20 cannot-link constraints.In figure 2, we plot the three pieces of the solution label vector x at four different ADMM iterations (1, 15, 25, and 200).In the first iteration, the initial clustering is random however satisfying the cardinality constraints, so it is binary but it does not lead to a good objective.As ADMM progresses, the continuous solution x becomes more and more binary, until it converges to a feasible binary solution where the three clusters are disjoint satisfying all constraints.Moreover, we also report the number of cardinality (CardV), mustlink (MLV), and cannot-link (CLV) violations at each of these iterations.These violations gradually decrease until convergence occurs, when no violations persist.We find this stable performance across all datasets as will be presented in later sections.Further detailed experiments can be found in the supplementary material.(i) Traditional K-means versus p Km. First, we start by comparing our vanilla constrained free version clustering method p Km against K-means.We show that p Km method can in fact attain very similar, if not better, performance than traditional K-means (builtin MATLAB function).This is clearly because both methods use the same objective value and that p Km does converge to good solutions.Experiments are conducted on some of the UCI datasets [7] (wine, iris and glass).Table 2 reports the K-means objective value, ARI(%), MI and HI(%) metrics for both methods.(ii) Traditional K-means versus p Km-Car.Here, we demonstrate that our framework coupled with only cardinality constraints outperforms traditional unconstrained Kmeans on a variety of synthetic data.This highlights the importance of having this prior information available and harnessing it in the clustering process.In these experiments, the cardinality constraints are generated from ground truth labels.To show that cardinality does in fact help clustering performance, we apply p Km-Car on the two synthetic datasets (Balanced and ImBalanced) and report their ARI, MI and HI results in Table 3.
For the Balanced dataset, the separation between the four groups of points increases.In fact, K-means tends to cluster points together such that each cluster has a similar variance as other clusters.Consequently, K-means clusters the highdensity points of the Balanced dataset together and groups the remaining less dense points into another cluster.In comparison, our framework exploits the cardinality constraints to achieve perfect clustering performance.Similarly, the Im-Balanced dataset contains two imbalanced clusters with very different densities, where the separation between them is increased.In this case, K-means often mixes data points be-tween clusters, since the cardinality constraints are not used.On the other hand, p Km-Car can almost perfectly predict the ground truth clustering labels.Interestingly, the variance of our results is much lower than that of K-means even though they both use the same clustering initialization.This indicates that the cardinality constraints afford our method robustness to the initialization.
To the best of our knowledge, all previous work that handles generic cardinality constraints do not have readily available code for comparison.Therefore, we only compare our method with traditional unconstrained K-means, so as to demonstrate the effectiveness of adding cardinality constraints to an unconstrained clustering method.(iii) Pairwise Constrained Clustering Methods versus p Km-Pair.Here, we compare our p Km-Pair method against several pairwise constrained methods from the literature, namely Constrained Clustering [13] (COP-KMEANS), Spectral Clustering [8], Penalized probabilistic Clustering (PPC) [10] and CCL [11].All pairwise constraints were randomly generated.Among these methods, only COP-KMEANS [13] and p Km-Pair exactly enforce the constraints, while the others incorporate them as soft pairwise constraints in their clustering framework.Consequently, Spectral Clustering, PPC and CCL may result in clustering violations.However, due to the heuristic nature of COP-KMEANS, it may lead to a situation where depending on the initialization no feasible solution is attained.We run all five methods on two UCI datasets (wine, iris) and ensure that all methods receive the same randomly generated pairwise constraints.Tables 4 and 5 report the performance of these methods on all discussed metrics.For each experiment, we also report the number of must-link (ml) and cannot-link (cl) constraints, as well as, the number of must-link violations (MLV) and the cannot-link violations (CLV).It is clear that p Km-Pair outperforms all other methods, while satisfying all the constraints.
(iv) K-means versus p Km-Mix.Here, we demonstrate the main motivation behind our flexible framework, namely its ability to incorporate both cardinality and pairwise constraints simultaneously in the clustering optimization.Firstly, and following previous work [13], we demonstrate that increasing the number of pairwise constraints (either must-link or cannot-link) with the same cardinality constraints consistently improves performance.We conduct this experiment on two different datasets: one synthetic (ImBalanced y=0.1) and one real (wine).Figure 3 compares our method against traditional K-means in such setup.Obviously, K-means does not benefit from the constraints while ours consistently improves in performance.Secondly, we compare all three variants of our framework, i.e. cardinality only constraints ( p Km-Car), pairwise only constraints ( p Km-Pair) and both ( p Km-Mix), on several UCI datasets (ionosphere, Hepatitis, Hepatitis1 and Breast Cancer Wis-D).The number of must-link and cannot-link constraints were equal for each dataset and they were set proportional to the dataset size VOLUME 4, 2016 TABLE 4. Comparison of several pairwise constrained clustering methods against p Km-Pair using ARI(%), as well as, must-link (MLV) and cannot-link (CLV) violations in the constraints.The cells indicated with x imply that the underlying method does not attain a feasible solution after 1000 runs.to (20,25,20,100) respectively.Results in Table 6 show that our method performs increasingly and significantly better, when more constraint categories are used simultaneously.This improvement reaches as high as 40% in ARI for some datasets.We also report in table 6 the runtime for all 3 different varients on all 4 datasets.The time vary depending on the dataset size and the number of clusters from 0.5 − 10 seconds.We did not compare against other methods here, since there is no existing work that combines both categories of constraints in a unified framework and extending the pairwise constrained methods to cardinality constraints is not trivial.

V. CONCLUSION
We proposed a new flexible framework to handle both pairwise and cardinality constraints for K-means clustering.The resulting integer program is transformed into an equivalent continuous reformulation where pairwise constraints are incorporated as quadratic constraints.The resultant problem is solved using ADMM.Extensive experiments have been conducted on both synthetic and real datasets to demonstrate the competitive performance of our method under different constraint choices and that the proposed method achieves state-of-art performance when both types of constraints are used simultaneously.As a future work, we seek to adapt our work to deep clustering approaches [27]- [29], which jointly learn feature representations and cluster assignments.That is to say, the ADMM updates will involove another minimization step that trains the encoder for the feature learning procedure while the other steps involve on sloving both the assignment and imposing the constraints.This shall have applications in biology, medicine, finance, and animation, as in many such applications, where pseudo labales in terms of assignments are provided as constraints, e.g., two patients are labeled to have a similar syndrome, feature learning will allow for further improvements in clustering performance.

VI. APPENDIX REFORMULATION OF CONSTRAINED K-MEANS.
We start with our formulation of constrained K-means in Equation ( 1) below (or Equation ( 4) in the manuscript).
We define X ∈ R k×n and therefore x = vect(X).The vect operator simply vectorizes the matrix one column at a time (i.e. one data point at a time).The order of concatenation is reverse for the label vector: w = vect(W), where W ∈ R n×k (i.e. the vectorization is done one cluster at a time).Therefore, it is important to point out that the order of the binary labels x and that of w is swapped.Reordering these vectors based on data points or clusters is controlled by the permutation matrix P ∈ R nk×nk .Therefore, Px = P ⊺ x = vect(X ⊺ ) and Pw = P ⊺ w = vect(W ⊺ ).Throughout the main manuscript and the appendix, we use P and P ⊺ to change the order of w to the same order as x and vice versa.Matrix S ∈ R d×n in Equation ( 1) simply concatenates all the data points in its columns, where Λ j ∈ R n×nk is zero everywhere except for the j th block that is identity, i.e.Λ j = = 0, . . ., I j k , . . . .As for Ψ ⊺ ∈ R n×nk , it is a binary matrix that has in each row a vector 1 ⊺ k that sums all the binary labels for a given data point while the rest are zeros.We write this matrix in blockwise form as follows: As for Q ∈ R k×nk , it sums all the binary labels of each VOLUME 4, 2016 cluster at a time for all the data points and its blockwise matrix form is given as follows: As for C ∈ R nk×nk , it sums the binary labels for each cluster at a time and its blockwise matrix form is as follows: As for the box and 2 -sphere constraints (which intersects with the binary vector space), we define two sets: Lastly, E 1 , E 2 ∈ R kv×nk and E 3 , E 4 ∈ R ke×nk are selection matrices for the must-link and cannot-link constraints respectively.They select the data points that are involved in both types of constraints.
Applying ADMM.Following the conventional treatment of an optimization problem using ADMM, we first formulate the augmented Lagrangian function is given as follows: L (x, w, z 1 , z 2 , z 3 , z 4 , y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 , y 8 , y 9 ) ∶= ADMM updates steps tend to update each primal variable ( x, w, and z 1−4 sequentially, while keeping the rest of these variables and the dual variables y 1−5 , y 6,8 , and y 6,9 ) set to their most recent values.After the primal variables are updated, the dual variables are updated via a single gradient ascent step.Next, we detail each update step and the underlying optimization sub-problem that needs to be solved.

Update
The aforemention problem is strongly convex quadratic in x.Therefore, a stationary point is necessary and sufficient for optimality.By equating the gradient to zero, we get: Update w: Similarly to the x-update, the problem is strongly convex quadratic and finding a stationary point is necessary and sufficient for a global solution.Thus, the gradient is given by: Then, we have In this derivation, we use the fact that ∇ w y T 5 Pw⊙ Cx) = P T Cx ⊙ y 5 .Tp see why this is the case, note the following identities: Therefore, Here, we need to perform a simple projection onto the box set S b .The projection P S b (.) is an elementwise clamping between 0 and +1 that is, P S b (a) = min(max(a, 0), 1) for a scalar value a.
Update z 2 : We need to perform a simple projection onto the 2 − sphere: The projection P S2 (.) (.) involves an elementwise shift and 2 vector normalization and thus is given as , for any vector a ∈ R n .
Update z 3 : The problem is strongly convex quadratic in z 4 , so we obtain the unique global minimizer by equating the gradient to zero as follows The problem is also strongly convex quadratic in z 5 , so we obtain the unique global minimizer by equating the gradient to zero as follows ρ 8 E ⊺ 3 E 4 xx ⊺ E ⊺ 4 E 3 + ρ 9 I nk z 5 = y 9 + ρ 9 x − y 8 E ⊺ 3 E 4 x Update y 1 , y 2 , y 3 , y 4 , y 5 , y 6 , y 7 , y 8 , y 9 : Lastly, we need to perform dual gradient ascent to update the dual variables as follows:

VII. AUXILIARY RESULTS
We present some additional experimental results that augment the discussion made in the manuscript.Primarily, we provide empirical evidence that our FCKm method and its constrained variants converge to binary solutions that satisfy different constraints (pairwise and cardinality).
Convergence for Unconstrained Clustering.In Figure 4, we plot the K-means objective value at each ADMM iteration  for an unconstrained clustering task with three clusters.Note that the initialization to this problem is a random threeway clustering.The objective decreases monotonically (after the first 2-3 iterations) and converges to a minimum value in approximately 6000 iterations.The optimization is stable with no perturbations at the onset of convergence.
In Figure 5, we plot the cluster vector for each of the three clusters being optimized at different iterations (1, 500, 2500, and 6530), i.e. we plot the three pieces of the label vector x.In the first iteration, the initial clustering is done randomly, so it is binary but it does not lead to a good objective.As ADMM progresses, the continuous solution x becomes more and more binary, until it converges to a feasible binary solution where the three clusters are completely disjoint.
Convergence for Constrained Clustering.In Figure 6, we plot the K-means objective value at each ADMM iteration for a constrained clustering task with three clusters.In this task, we enforce cardinality, must-link, and cannotlink constraints onto the optimization.The initialization is taken to be a random assignment between the three disjoint clusters.In this case, the objective tends to be monotonically increasing after the first few iterations.This might seem counter-intuitive, since we are trying to minimize the objective.However, it must be noted that the continuous solution vector x in each iteration tends not to be feasible with respect to the enforced constraints.In other words, these constraints are being enforces more and more as the ADMM process proceeds, which forces the tradeoff between objective and feasibility.But, similar to the unconstrained case, the variation in objective is smooth and no perturbations are exhibited when ADMM begins to converge.
Moreover, Figure 4 plots the solution vectors for each cluster at four different iterations (1,15,25,200).A similar behavior to the unconstrained case is encountered, where disjoint binary solution vectors are converged to.However, the notable difference is that we also report the number of cardinality (CardV), must-link (MLV), and cannotlink (CLV) violations at each of these iterations.We see that these violations gradually decrease until convergence occurs, when no violations persist.
and vect(B) is simply a columnwise vectorization of the matrix B. b: Update w:

FIGURE 2 .
FIGURE 2. Convergence of the solution x using p Km-Mix with random binary initialization that satisfy the cardinality constraints on the Wine dataset.

ACKNOWLEDGMENT
The work was done when Adel Bibi was at KAUST.This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research and the Deanship of Scientific Research, King Khalid University of Kingdom of Saudi Arabia under research grant number (RGP1/357/43).ADEL BIBI is a senior research fellow in machine learning and computer vision at the Department of Engineering Science of the University of Oxford.He is also a Junior Research Fellow (JRF) of Kellogg College.Prior to that, he was a postdoctoral researcher at the same group for a year since October 2020.Adel received his MSc and PhD degrees from King Abdullah University of Science and Technology (KAUST) in 2016 and 2020, respectively.He received his BSc degree in electrical engineering with class honors from Kuwait university in 2014.Adel has been recognized as an outstanding reviewer for CVPR18, CVPR19, ICCV19, and ICLR22, and won the best paper award at the optimization and big data conference in KAUST.He has published more than 20 papers in CVPRs, ECCVs, ICCVs and ICLRs some which were selected as orals and spotlights and is going to serve as an area chair for AAAI23.ALI ALQAHTANI received his Ph.D. in computer science from Swansea University, Swansea, U.K., in 2021.He is currently an Assistant Professor with the Department of Computer Science, King Khalid University, Abha, KSA.He has published several refereed conference and journal publications.His research interests include various aspects of pattern recognition, deep learning, and machine intelligence and their applications to realworld problems.BERNARD GHANEM is currently a Professor in the Electrical and Computer Engineering program at King Abdullah University of Science and Technology (KAUST).His research interests lie in computer vision and machine learning with emphasis on topics in video understanding, 3D recognition, and theoretical foundations of deep learning.He received his Bachelor's degree from the American University of Beirut (AUB) in 2005 and his MS/PhD from the University of Illinois at Urbana-Champaign (UIUC) in 2010.His work has received several awards and honors, including several Best Workshop Paper Awards in CVPR and ECCV, a Google Faculty Research Award in 2015 (1st in MENA for Machine Perception), and a Abdul Hameed Shoman Arab Researchers Award for Big Data and Machine Learning in 2020.He has co-authored more than 100 peer reviewed conference and journal papers in his field as well as three issued patents.He serves as Associate Editor for IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) and has served as Area Chair (AC) for the leading computer vision and machine learning conferences.

FIGURE 4 .
FIGURE 4. Convergence of the K-means objective value us-ing FCKm with random initialization on the Wine dataset.Note the decreasing nature of the objective and its smooth convergence to the solution

FIGURE 5 .
FIGURE 5. Convergence of the solution x using FCKm with random initialization on the Wine dataset.

FIGURE 6 .
FIGURE 6. Convergence of the K-means objective value using FCKm-Mix with a random initialization on Wine dataset.

TABLE 1 .
Lists the total number of points, clusters and features of all UCI datasets used in the experiments.
FIGURE 1.The Balanced and ImBalanced in the two consecutive rows respectively.They comprise two clusters (red/black) with an increasing separation between clusters.

TABLE 2 .
Comparison between K-means and p Km on real UCI datasets using K-means objective value, ARI(%), MI and HI(%) along with the standard deviation.

TABLE 3 .
Comparison between K-means and p Km-Car on synthetic balanced and Imbalaced synthetic datasets using ARI(%), MI and HI(%).

TABLE 5 .
Comparison of several pairwise constrained clustering methods against p Km-Pair using MI and HI(%).The cells indicated with x imply that the underlying method does not attain a feasible solution after 1000 runs.Effect of increasing must-link and cannot-link constraints separately, as compared to unconstrained K-means.

TABLE 6 .
Comparison of several pairwise constrained clustering methods against p Km-Pair using MI and HI(%).The cells indicated with x imply that the underlying method does not attain a feasible solution after 1000 runs.