Enhancing Cluster Analysis With Explainable AI and Multidimensional Cluster Prototypes

Explainable Artificial Intelligence (XAI) aims to introduce transparency and intelligibility into the decision-making process of AI systems. Most often, its application concentrates on supervised machine learning problems such as classification and regression. Nevertheless, in the case of unsupervised algorithms like clustering, XAI can also bring satisfactory results. In most cases, such application is based on the transformation of an unsupervised clustering task into a supervised one and providing generalised global explanations or local explanations based on cluster centroids. However, in many cases, the global explanations are too coarse, while the centroid-based local explanations lose information about cluster shape and distribution. In this paper, we present a novel approach called ClAMP (Cluster Analysis with Multidimensional Prototypes) that aids experts in cluster analysis with human-readable rule-based explanations. The developed state-of-the-art explanation mechanism is based on cluster prototypes represented by multidimensional bounding boxes. This allows representing of arbitrary shaped clusters and combines the strengths of local explanations with the generality of global ones. We demonstrate and evaluate the use of our approach in a real-life industrial case study from the domain of steel manufacturing as well as on the benchmark datasets. The explanations generated with ClAMP were more precise than either centroid-based or global ones.

or confronted with domain knowledge and experts' experi- 23 ence. To deal with this, Explainable AI (XAI) methods are 24 The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .
being developed to bring transparency to the decision-making 25 process of AI-based systems [1]. This trend is especially 26 visible in the area of Industry 4.0, where a large amount 27 of data is gathered directly from hardware and is used to 28 discover patterns or anomalies in machinery operation as well 29 as to provide decision support based on the results. A human 30 operator is usually involved in the analysis and verification 31 of the decisions of the system because the control of the 32 critical system components cannot be left solely to the AI 33 system. On the other hand, this requires the model to be 34 understandable by a domain expert, as depicted in Figure 1. 35 industrial case of the hot-rolling process from the steel indus-72 try. To achieve this, we attempted to represent clusters with 73 multidimensional prototypes and utilise these prototypes in 74 the explanation process. The developed methodology can be 75 divided into the following stages: 76 • Execute clustering with an arbitrarily selected method; 77 • Reformulate the problem to the classification task; 78 • Generate cluster prototypes in the form of multidimen- 79 sional bounding boxes and obtain rule-based explana-80 tions for them.

81
• Evaluate generated rules with the use of the HeaRTDroid 82 inference engine [3] and experts' knowledge. 83 This work is carried out in the CHIST-ERA Pacmel project. 84 The project aims to develop novel methods of process mining, 85 knowledge modelling, and intelligent sensor data analysis 86 in Industry 4.0. In the area of rules and inference engines, 87 we build on our previous works including the XTT2 (for-88 malised rule representation) rule-based knowledge represen-89 tation and the HeaRTDroid inference engine [3], which were 90 developed by us using the Semantic Knowledge Engineering 91 methodology [4]. 92 The reminder of the paper is organised as follows: in 93 Section II, we describe the works concerning the explain-94 able methods. This is the foundation for our motivation and 95 original contribution described in Section III. In Section IV, 96 we concentrate on describing the clustering and classification 97 methods and present a novel approach to building proto-98 types for clusters. This section also includes the descrip-99 tion of a method for obtaining rule-based explanations for 100 discovered prototypes. In Section V, we present a func-101 tional evaluation of ClAMP in comparison to centroid-based 102 and global explanations used in state-of-the-art solutions. 103 In Section VI, we perform human-grounded evaluation on 104 synthetic datasets with 24 participants involved in the pro-105 cess. Finally, in Section VII, we move on to the case study 106 The authors in [7] develop the Single Feature Introduction (FCPS). In both cases, this method is able to correctly uncover 161 patterns. 162 In [11], the adopted method concentrates on the centres 163 of the clusters. Discovered Cluster-based sentence utility 164 (CBSU, or utility) refers to the degree of relevance (on a scale 165 from 0 to 10) of a particular sentence to the general topic of 166 the entire cluster. However, such methods are very sensitive to 167 the shape of the clusters and can be executed only in specific 168 cases. 169 Many explainability approaches consider the use of 170 tree-based clustering models. According to [12], the most 171 popular method is cluster representation with the use of their 172 centroids. However, in the case of the not compact or non-173 isotropic cluster, such a method cannot be executed success-174 fully. Another common approach is that of visualisation with 175 the use of principal component analysis but, in this case, 176 we lose the relationship between the clusters and the original 177 variable. In [12], the authors propose an unsupervised learn-178 ing algorithm that solves the task using an optimisation lens 179 while providing the user with more accurate and interpretable 180 results based on the feature vectors. They use Silhouette 181 Metrics and Dunn Index, as the objective function. Tests were 182 executed using datasets from FCPS and real-world examples. 183 In [5], the authors use methods of supervised machine 184 learning for cluster interpretation by changing the problem 185 into a classification case. Particularly, they analyse which fea-186 tures are necessary to assign instances to the correct cluster. 187 This allows recognising the characteristics relevant to specific 188 cluster structures. 189 The method presented in [6] aims to explain the out-190 come of unsupervised algorithms. Generally, the framework 191 relies on the expert's knowledge to, i.a., extract the cor-192 rect features (feature selection). When the data is embed-193 ded, EXPLAIN-IT uses unsupervised learning techniques 194 to explore it. In particular, EXPLAIN-IT uses a clustering 195 technique that plays the role of a meta-learning approach, 196 which reduces the complexity of the analysis using the idea 197 of clustering methods -aggregating similar instances.

198
In [13], the authors outline that there are no effective meth-199 ods to apply to security tasks. In their paper, they propose a 200 dedicated method that generates a small set of interpretable 201 features to explain how the input sample is classified.  In [19], the authors present a novel model-agnostic algo-275 rithm called The Anchor. Based on the given instance, the 276 Anchor algorithm generates a rule that sufficiently decides 277 the prediction locally. It should be emphasized that changes 278 to other feature values of the instance do not essentially 279 affect the prediction value. For each instance, the Anchor 280 is executed with an empty rule, subsequently, in an iterative 281 fashion, new rules are generated and the previous is replaced 282 if the precision is lower.

284
Most of the methods mentioned in the previous section are 285 focused on a specific task and tuned to work with particular 286 clustering algorithms, or with a particular audience. On the 287 other hand, general frameworks such as [5], [6], [7], and 288 [10] focus mostly on global explanations, which limits the 289 details presented to the user and reduces the capabilities of 290 in-depth cluster analysis. In Table 1, we present a summary of 291 related works in the area of explainable clustering. One can 292 observe that there is no solution that will satisfy the hybrid 293 explanations mechanism that will: 1) allow for a balance 294 between the expressiveness and granularity of the generated 295 results, 2) allow the use of an arbitrary selected clustering 296 algorithm, 3) allow the use of an arbitrary selected classi-297 fication method to discover patterns between clusters, or 4) 298 provide explanations in an executable format that allows for 299 easier, automated integration with other system components. 300 In our approach, we aim mainly to provide a method 301 that will address all of the above four issues. The starting 302 point of this work was the preliminary results introduced at 303 the IEEE DSAA 2021 Conference [20]. Here, we present a 304 fully developed approach, enclosed within a methodological 305 framework for cluster analysis with multidimensional proto-306 types (ClAMP) and evaluated on a real-life industrial case 307 and benchmark datasets. The most important aspects of our 308 original contribution include the following:

309
• We expanded the possibility of cluster represen-310 tation. We added another method for discovering 311 VOLUME 10, 2022 cluster prototypes. We were interested in how ran- In the following section, more details on the ClAMP 330 methodology will be provided.

333
The main goal of our work on ClAMP was to provide a 334 method for cluster analysis that will be agnostic with respect     In the following sections, these main phases will be 351 described in detail. we tested the following clustering methods to assign labels to 365 the analysed datasets: Gaussian Mixture, BIRCH (balanced 366 iterative reducing and clustering using hierarchies), and the 367 Deep temporal clustering algorithm. The first two methods 368 described above are implemented in scikit-learn [21]. The 369 third method is presented in [22]. The algorithm utilises 370 an autoencoder for temporal dimensionality reduction and a 371 novel temporal clustering layer for cluster assignment. Then, 372 the clustering and dimensionality reduction objectives are 373 optimised. To detect the optimal number of clusters, we used 374 silhouette score; however, the choice of the metric used for 375 selecting the number of clusters is not limited.

376
It is worth noting that this stage is independent of the whole 377 methodology. In fact, one can also apply our approach to 378 the dataset which originally contained labels, or where labels 379 were obtained using expert knowledge instead of a clustering 380 algorithm. This could be particularly useful in cases where 381 the cluster analysis is performed mainly for conformance 382 checking with existing domain knowledge [23]. To reformulate the clustering problem into the classification 386 task it is necessary to find a classifier that reproduces labels 387 obtained during the clustering stage in the best possible way. 388 In our work, we chose XGBoost (Gradient Boosting 389 framework) classifier [24] as the classification algorithm, 390 an optimised distributed gradient boosting open-source pack-391 age designed to be highly efficient, flexible, and portable. 392 It implements machine learning algorithms under the Gra-393 dient Boosting framework. XGBoost provides parallel tree 394 boosting that solves many data science problems in a fast 395 and accurate way. Results demonstrated in [24] show that 396 the XGBoost classifier can be used for a wide range of prob-397 lems. Classifiers have great potential and allow the obtaining 398 of good results; however, they have a lot of hyperparam-399 eters that directly affect these results. To account for this, 400 hyperparameter tuning should be done during the algorithm 401 performance [25]. There is a possibility to do it manually, 402 but in such a case, the user can not be sure that the best 403 parameter settings have been determined. To do it auto-404 matically, a simple GridSearch algorithm can be applied 405 that allows checking each combination of parameter values 406 defined in their domains (ranges) determined by the user. 407 In our case, we applied a RandomizedSearchCV (Random-408 izedSearch Cross-Validation) available in scikit-learn [21], 409 because this optimiser allows obtaining satisfying results 410 by trying only a fixed number of parameter settings. Ran-411 dom search is actually more practical than grid search [25], 412 as it does not test all parameters but executes the search 413 at random. For the automatic hyperparameter tuning, other 414 optimization methods can also be applied, e.g., the Sequen-415 tial Model-Based Optimization (SMBO) implemented in the 416 model-based optimisation package (mlrMBO) [25], [26].

FIGURE 2. ClAMP methodology diagram.
To validate the effectiveness of the classification methods 418 built on top of cluster labels, several metrics can be used.

419
In our case, we used recall, precision, F1-score, and accuracy. It is worth noting that for clusters of different shapes, 442 different bounding boxes may be suitable for different clus-443 ters. Therefore, in ClAMP, we optimise the selection of the 444 method for each of the clusters separately. The selection of 445 CLAMP hyper-parameters can be done automatically with 446 any optimisation algorithm and with respect to the target 447 metric we want to optimise (for instance, the accuracy of the 448 explanations obtained). Selection of a metric and optimisa-449 tion algorithm depends on the task we want to solve and the 450 data we use (e.g., balanced, imbalanced, etc.) and, therefore, 451 VOLUME 10, 2022     Implementation of the K-D tree algorithm requires tuning 491 of some of the hyperparameters like leaf size and metric. 492 According to the documentation, the ''leaf size'' parameter 493 does not affect the results of the algorithm, so the default 494 value was used. For the ''metric'' parameter, two possible 495 values were considered in this paper: ''minkowski'' and 496 ''manhattan''. Because the bounding box we are looking 497 for consists of the outremost points, we added one more 498 hyperparameter which is the percentage of the farthest points 499 from the centre of each cluster. Exemplary points determined 500 by the K-D tree approach are presented in Figure 5. As can 501 be seen, for each of the clusters, the KD-tree algorithm found 502 the outermost points (boundaries of each cluster), which was 503 one of the goals of our developed methodology. The isolation forest method is one of the ways to execute 506 outlier detection in high-dimensional datasets. The princi-507 ple of operation is to ''isolate'' observations by randomly 508 selecting a feature and then randomly selecting a split value 509 between the maximum and minimum values of the selected 510 feature [21], [28]. In the algorithm, the recursive partitioning 511 can be represented by a tree structure, while the number of 512 splittings required to isolate each sample is equivalent to the 513 path length from the root node to the terminating node. The 514 length of the path mentioned above is the measure of nor-515 mality and our decision function, and this length is averaged 516 over a forest of random trees. Thanks to random portioning, 517 shorter paths for anomalies are produced. Hence, when the 518 random trees collectively produce shorter paths, it is more 519 probable to assign a sample as an anomaly [29], or bounding 520 box point.  In the methodology developed in this paper, the Isolation

531
Forest algorithm is applied to detect the outer points belong-532 ing to the specified cluster, which can be used to execute rules.

533
The set of hyperparameters allows adjusting the algorithm to  able to construct an explanation whose coverage is adapted 564 to the model's behaviour, and clearly determine their bound-565 ary [30].

566
Exemplary rules generated by the Anchor explainer are 567 presented in Table 2. The Cluster column determines the 568 number of the cluster which is determined by the rule. The 569 Coverage and Precision columns describe respectively: the 570 ratio of the number of instances for which the rule holds 571 in the whole dataset and its precision on this subset of 572 instances.

573
The rules obtained with the Anchor algorithm can be 574 directly analysed by the expert but can also be formalised and 575 executed. This allows for automatic evaluation of the rules 576 obtained within the ClAMP methodology as well as easier 577 integration with other system components. For the purpose of 578 representation and execution of the rules, we use the HMR+ 579 rule language and HeaRTDroid inference engine described in 580 the following paragraphs. The HeaRTDroid is a rule-based engine that uses the 583 rule-based language HMR+, which allows reasoning and 584 handling of uncertain and incomplete knowledge. The 585 HMR+ language used by the HeaRTDroid also allows for 586 modelling uncertainty with certainty factor algebra [3].

587
In our methodology, HeaRTDroid is used for executing a 588 rule-based model consisting of rules, precision and cover-589 age parameters, and cluster numbers determined by the rule, 590 as shown in Table 2. The key idea of using HeaRTDroid is 591 to evaluate the effectiveness of the rule-based model which is 592 provided by the Anchor algorithm.

593
More specifically, the rule-based model with the above-594 mentioned parameters and data points without any labels is 595 treated as an input to the HeaRTDroid interference engine. 596 Then, the HeaRTDroid is executed and the main task of this 597 stage is to predict the cluster number based on the given 598 rule, precision, and coverage parameters, and the point under 599 test. This action is executed for each point in the tested 600 dataset. As a result, a cluster number is predicted for each 601 tested point. with experts is also possible in the ClAMP methodology. 615 We provide rules which consist of the features names, values, 616 and inequality signs − human-readable form, to the experts.

617
The task is to check the rules generated by the Anchor 618 algorithm and evaluate them. An important issue for our 619 methodology is to obtain rules which would be understand-620 able and useful for the experts, which means that after looking 621 at them, the expert should be able to clearly assign which 622 rules concern which cluster and determine how well these 623 rules describe the cluster. Additionally, the expert should 624 be able to determine whether these rules bring information   available datasets. The goal of this section was to confront 653 the novel ClAMP methods of generating explanations with 654 state-of-the-art approaches that are based on cluster centroids 655 or global explanations. This forms a reproducible set of 656 tests, focused on the functional evaluation (no human factor 657 involved) that can be used to achieve an unbiased comparison 658 of our method with other approaches. 2 The factor that we took 659 into consideration in this type of evaluation was the quality 660 of the explanations in terms of accuracy. We wanted to prove 661 that ClAMP provides more accurate explanations at a similar 662 level of complexity (e.g., length of the rule, number of rules) 663 compared to centroid-based and global explanations.

664
All of the phases of ClAMP (see Figure 2) were fully 665 automated and optimised with the GridSearch algorithm. The 666 generated rules were tested against selected quality metrics 667 (i.e., accuracy, F1, precision and recall) in a 10-fold cross-668 validation approach. As a result, we obtained 10 measure-669 ments for each of the combinations of dataset and bounding 670 box selection methods. The summarised results for the F1 671 metric are presented in Table 3.

672
Our goal was to show that ClAMP selection methods are 673 better than centroid-based and global ones. Therefore, we per-674 formed a Friedman test followed by a Nemenyi pairwise 675 post-hot test for multiple comparisons of mean rank sums.

676
From the Friedman test, we obtained statistics equal to 677 28.0, with a p-value equal to 0.000008. With 6 algorithms 678 and 14 datasets, we have 5 and 65 degrees of freedom respec-679 tively, which allows us to determine that the critical value for 680 F(5, 65) for α = 0.05 is 2.35. This allows us to reject the null 681 hypothesis.

682
After this, we performed a Nemenyi test to observe how the 683 algorithms differ, and between which algorithms the differ-684 ence is statistically significant. The results from the post-hoc 685 Nemenyi test are presented in Tabel 4 and also visualised in 686 Figure 7.

687
It can be observed that the critical distance is 2.015, and we 688 can prove that ClAMP is significantly better than other meth-689 ods in achieving good quality explanations. It is worth noting 690 that each of the bounding box methods taken separately (i.e., 691 Isolation forest, Random selection, K-D tree query) might 692 not be significantly better than the others; it depends on the 693 cluster shapes and, thus, the dataset used for clustering. It also 694 depends on the clustering algorithms used (e.g., K-means 695 produce similarly shaped clusters, while DBSCAN might 696 produce arbitrarily shaped groups). Therefore, using ClAMP 697 in order to optimise the selection of the bounding box is a 698 reasonable approach. 699 2 The datasets along with the source code of the benchmark were made publicly available at https://github.com/sbobek/clamp TABLE 3. Comparison of F1 performance. Column denoted as ClAMP represents the combined approach that integrates all of the bounding box methods, including Isolation forest, K-D tree query and Random selection optimised against selected quality measures. The values after ± denote standard deviation in 10-fold cross-validation. In the following sections, we evaluate the explanations    Figure 10 728 The last step was to provide the dataset to the participants 729 who were asked to use ClAMP methodology to generate 730 explanations for discovered clusters by tuning hyperparam-731 eters of ClAMP and finally evaluate their quality.

732
The dataset was randomly chosen for each participant. 733 After the programming task was completed, the partici-734 pants were obliged to fill in a survey containing evaluation 735 questions. 3

736
The next section presents the obtained results from the 737 evaluation on synthetic datasets.

739
In the following section, we present results obtained by the 740 25 participants who took part in the study. Each participant 741 was asked to evaluate the clustering results and explana-742 tions according to the 4 criteria listed below. Additionally, 743 we asked the participants several questions concerning each 744 of the criteria used to obtain the evaluation.   cluster results. To evaluate this, we asked two questions to 782 the participants.

783
Question asked to the participants: Are the rules ade-784 quate to explain a given cluster or more individual instances 785 in the cluster?

786
Answer: The answer is presented in Figure 11, where 1 787 corresponds to the cluster and 5 to the instance. The majority 788 of the participants decided that the explanations better explain 789 the whole cluster rather than a single instance. The median 790 value of the response is equal to 2. Therefore, they assure a 791 good level of generality.

792
Question asked to the participants: How many rules 793 (maximum) can each cluster be described with so that the 794 rules are still understandable?

795
Answer: The answer is presented in Figure 12. Most par-796 ticipants who provided rather a low number of rules are better 797 at explaining the cluster. This is the premise for the conclu-798 sion that the rules generated by our method are expressive 799 enough to give a sufficient amount of information (details) to 800 participants. Two of the participants significantly stand out 801 from the rest of the results.    . What would be more time-consuming to distinguish and describe the clusters: using rules or using available cluster labels?

FIGURE 14.
How are the rules understandable to you, i.e., do they provide information on the basis of which you are able to draw dependencies between them?
Answer: The answer is presented in Figure 13. Where 1 810 corresponds to the rules and 5 to the labels. The median value 811 of the results is equal to 3. This means that the participants 812 state that in both cases time could be comparable. However, 813 it could be caused by the fact that we allow considering this 814 question to relatively easy-to-understand datasets.

818
The goal of this criterion was to determine if the method 819 can be evaluated by participants who possessed only domain 820 knowledge and not having any experience connected with 821 data science.

822
Question asked to the participants: How are the rules 823 understandable to you, i.e., do they provide information on 824 the basis of which you are able to draw dependencies between 825 them?

826
Answer: The answer is presented in Figure 14. Where 1 827 corresponds to non-understandable and 5 to understand-828 able. Most of the participants agreed that the generated rules 829 were understandable for them. The goal of this criterion was to evaluate the overall use-832 fulness of the rules. In the case of very similar rules, the 833 challenge is to separate the rules among clusters, and thus 834 their analysis is complicated. If the rules differ significantly, 835 it is much easier to assign the rules to a specific cluster and 836 determine the differences in the clusters.

837
Question asked to the participants: How do the rules help 838 distinguish clusters and understand how they differ?

839
Answer: The answer is presented in Figure 15. Where 1 840 corresponds to not helpful at all and 5 to very helpful. All 841 VOLUME 10, 2022 Answer: The answer is presented in Figure 17.

866
To finally evaluate our methodology based on the artificially 867 generated datasets we decided to ask another question. In the 868 script, there was a possibility to generate rules not based on 869 the bounding box prototypes but based on the centroid point 870 in each of the clusters. As a result, the participants were 871 able to compare results obtained for each of the methods and 872 answer the following question.

873
Question asked to the participants: In comparison to a 874 benchmark (centroids based) are the ClAMP results better? 875 Answer: The answer is presented in Figure 19. Almost 876 80% of the responders answer that the ClAMP methodology 877 allows obtaining rules that better describe clusters and help 878 to understand them. The overall evaluation results suggest that our methodology is 882 useful for participants in cluster analysis. Taking the obtained 883 results into consideration, we are able to state that the devel-884 oped methodology delivers satisfactory results in the case of 885 application to artificially generated datasets.

886
Participants agreed that the ClAMP methodology allows 887 describing better clusters than each of the instances, which 888 was one of our goals (see Figure 11). Additionally, Figure 12 889 depicts the answers to the question of how many rules the 890 participants considered satisfactory to maintain clarity of 891 the rules. It shows the maximum number of rules that are 892 satisfactory to understand clusters well. Only three of the par-893 ticipants answered that this number could be greater than 10. 894 This answer is aligned with our observation that interpretable 895 models are not always explainable due to their complexity. 896 a more time-consuming cluster description, using the rules 898 or using instances, see Figure 13. We assume that the use of 899 rules should be much faster than the use of each instance. almost 900 • C. Finally, each prepared product is coiled and 951 transferred to storage [32]. Figure 20 shows a schematic 952 diagram of the hot rolling process.

953
For the analysis, we took into consideration 10 000 dif-954 ferent slabs with four parameters for each of them: width, 955 profile, tempexit, and tempcoil with calculated average and 956 standard deviation for each of these parameters. These param-957 eters were chosen as key parameters in the case of final 958 product quality. The choice was made by the experts. Our 959 assumption was to treat the case as an unsupervised machine 960 learning problem because such an approach gives opportuni-961 ties to discover data patterns that would be imperceptible to 962 the experts. The industrial problem considered in this paper 963 is directly connected with the hot-rolling process described in 964 this section. Based on the obtained parameters, we performed 965 clustering. As a result, all considered slabs were divided into 966 three groups to allow us to suppose that in the production 967 phase, occurring processes affect the final quality of the prod-968 uct. In cooperation with the experts, we decided to use the 969 ClAMP methodology to uncover differences between these 970 three groups. Such classification and fully understanding the 971 dependencies between groups may result in better process 972 management.

973
As the analysed problem is treated as an unsupervised 974 problem, we decided to present the results separating them 975 into two stages, resolved as clustering and classification; rules 976 creation and evaluation.

978
The initial step in the methodology includes data cluster-979 ing for obtaining good quality clusters. We tested three 980 different methods as mentioned in Section IV-A. We used 981 three different types of algorithms: temporal (DTC), density-982 based (Gaussian mixture), and one from the K-means family 983 (BIRCH). In the case of the DTC (deep temporal clustering) 984 clustering method and the BIRCH algorithm, we used the 985 silhouette score in the Gaussian mixture method and deter-986 mined the number of clusters based on the BIC (Bayesian 987 Information Criteria) and AIC (Akaike's Information Crite-988 ria) metrics [21], [33]. The comparison of obtained results is 989 presented in Table 5. It is visible that the BIRCH algorithm 990 performed the best over all the others, hence it was chosen as 991 the method for further analysis.

992
After the cluster labels were obtained, we used the 993 XGBoost classifier with hyperparameter optimisation to 994 build a model that will be able to distinguish clusters as 995 accurately as possible. For BIRCH clustering, we obtained 996 VOLUME 10, 2022  In the final step of this stage, we obtained a dataset with 999 labels split into 3 classes with distribution: 0.43, 0.43, 0.14 1000 respectively to classes 0, 1, and 2.
Therefore, the generated rules, a product of precision and 1034 coverage parameters, and a testing subset were treated as 1035 the input to HeaRTDroid, based on which, HeaRTDroid pre-1036 dicted the cluster label for each testing point. As a result, 1037 we obtained an array of predicted labels based on the rules 1038 generated for the considered bounding boxes. These labels 1039 were then compared with the labels generated by the cluster-1040 ing algorithm for the testing subset.

1041
To check the effectiveness of the generated rules, we pro-1042 posed four metrics particularly described in Section IV-D3. 1043 The final results of the evaluation are presented in Table 6, 1044 which contains the final results of the developed methodology 1045 obtained for each of the considered methods used to define 1046 the bounding box.

1047
As we decided to treat the F1-score as the target metric, 1048 the best score, equal to 0.98, was obtained for the ClAMP 1049 method, which combines all of the bounding box description 1050 methods, as presented in Table 6.

1051
The quality of rules in terms of classification metrics can 1052 be adjusted by changing the number of points used to form a 1053 bounding box. In general, the more points we create, the more 1054 precise the classification we obtain. As a trade-off, we lose 1055 the interpretability due to the increased number of rules anal-1056 ysed by the expert. Figure 21 presents charts that show how 1057 classification scores change, with the number of considered 1058 bounding points defined in percentages. Additionally, we are 1059 able to present such dependencies for two different metrics 1060 applied for the KD-tree description method. The number of 1061 points for which we obtained the best F1-score for each of the 1062 considered methods is marked by the red dashed line.

1063
In Figure 21, the highest changes in values occur for the 1064 random selection method and increase monotonically, which 1065 is understandable as it covers more and more points from 1066 the dataset. This increases accuracy but makes the explana-1067 tion model more complex and thus less useful in practical 1068 applications. For centroid-based methods, the change is not 1069 visible as one point is always used as the ''bounding box''. 1070 In comparison with the rest of the charts, using random 1071 selection and isolation forest methods, we can see that the 1072 KD-tree method needs fewer points to obtain comparable 1073 results than the others. Such a presentation, together with the 1074 knowledge of how the presented methods work, may be very 1075 informative for the experts. With the 26 obtained rules, we performed an evaluation with 1078 three domain experts recruited from ArcelorMittal who were 1079 asked to use CLAMP to obtain explanations for clusters 1080 and later answer the same set of questions as presented in 1081 Sections VI-B2. In Table 7, we present the synthesised results 1082 we obtained both from domain experts in the industrial case, 1083 and from participants involved in the evaluation on artificial 1084 datasets.

1085
For the analysis of the results, one can see that answers are 1086 slightly different depending on the dataset. This is especially 1087   In Section VI-B1, two experts agreed with the statement that 1094 the rules presented in our approach were better for explaining 1095 clusters, not single instances, which was our intention. One of can be drawn from < 10 rules.

1105
In addition, the expert pointed out that the set of rules that 1106 describe limits for a single parameter at once is more intuitive 1107 than the interpretation of one rule describing the relation 1108 between 3 or more features. Rules built on the basis of 1109 2 parameters seem to have the right balance between the 1110 amount of information and the availability of its evaluation. 1111 In comparison to the analysis without explanations 1112 (Section VI-B2), two experts agreed that our method can 1113 strongly decrease the time needed to distinguish and describe 1114 the clusters. One of the experts pointed out that the method 1115 allows adjusting the complexity of the conditions set depend-1116 ing on the time constraints of the user for which they are 1117 prepared. For fast evaluation, the generated conditions could 1118 be visualised and put in the context of product measurements. 1119 For the purposes of advanced study, the amount of infor-1120 mation is more important than the time of analysis. In that 1121 case, the computed rules could be an intermediate dataset for 1122 further analysis.

1123
The domain knowledge is as important as the data sci-1124 ence background in analysing the results of the explana-1125 tions (Section VI-B3). The data, which were treated as an 1126 input to the ClAMP methodology in the presented use case, 1127 were delivered to the data scientist and based on statistical 1128 VOLUME 10, 2022 properties such as standard deviations, variance, etc. We also consulted experts without data science experience about the 1130 results. They pointed out that they prefer rules which would 1131 be generated based on real production parameters, not on the conditions by an inexperienced user such as management three different clustering methods. As a classification algo-1179 rithm, we used XGBoost with hyperparameter optimisation. 1180 In the second stage, we implemented three methods that can 1181 be used to determine the prototypes (bounding boxes) that are 1182 then treated as an input to the explainer algorithm. Thanks to 1183 the application of these methods, the clusters are described 1184 only by the most representative points that allow avoiding the 1185 generation of unnecessary rules which can introduce noise 1186 into the explainability mechanism. Such an approach limits 1187 the computational time needed to generate explanations and 1188 increase explainability transparency. Hence, the proposed 1189 approach increases the effectiveness and efficiency of the 1190 rules generation. The generated bounding boxes are treated as 1191 an input to the Anchor explainers to generate rules for each 1192 cluster. The Anchor explainer is able to generate precision 1193 and coverage parameters as well. Thanks to these, there is 1194 a possibility to determine which rules describe the cluster 1195 better. These parameters are also useful for checking the 1196 effectiveness of the created rules. To do that, we used the 1197 HeaRTDroid rule-based inference engine which allows pre-1198 dicting labels based on the generated Anchor explainer rules 1199 and parameters returned. It also allows for the integration 1200 of the knowledge discovered using the XAI method with 1201 other system components. As a result, we implemented an 1202 approach that allows delivering human-readable rules to the 1203 experts taking into consideration different clustering meth-1204 ods, hyperparameter optimisation, and a novel approach to 1205 generating bounding boxes for evaluation.

1206
In comparison to the methods presented in Section II, 1207 we noticed two main differences. Firstly, the developed 1208 methodology allows obtaining a cluster representation in the 1209 form of human-readable rules. It is worth emphasising that 1210 we do not concentrate on explaining particular instances but 1211 rather on the whole group. This gives the opportunity to 1212 deliver to the experts information about the considered groups 1213 and understand the data division into clusters. Secondly, our 1214 methodology can verify the obtained rules with the use of 1215 HeaRTDroid, which allows predicting labels based on the 1216 instances and rules obtained based on the train set. 1217 We demonstrated our approach using two cases. The first 1218 concerns publicly available, artificially generated datasets 1219 that can be considered benchmark cases. The second case 1220 concentrates on the real-life use case scenario with confi-1221 dential data shared for the purpose of the PACMEL project 1222 from the company ArcelorMittal. Taking into account the 1223 assessment of the rules by experts, the idea of the proposed 1224 methodology proved that it is useful. According to the com-1225 pleted questionnaire, the experts pointed out that the rules 1226 help them describe the clusters, and such understanding is 1227 less time-consuming than in the case of the labels themselves. 1228 Although participants noticed that descriptions of clusters 1229 usually contain overlapping rules, they were able to identify 1230 the redundant parts and correctly interpret the explanations. 1231 The bottom line is that the generated rules are able to provide 1232 useful information about clustering. methodology pointed out by the experts is the fact that some 1235 of the generated rules overlap. In future work, we are planning 1236 to find a solution concerning that limitation, based on our prior works in this area [35]. We also plan to adjust the bounddiscovering cluster prototypes, not for the whole data but for 1240 each cluster separately. Taking into account the approach for 1241 Knowledge Augmented Clustering (KnAC) presented in [23], 1242 which is based on the clusters' centroids, ClAMP will be 1243 considered as an extension of KnAC.

1382
MICHAL KUK received the M.Sc. degree from 1383 the Faculty of Drilling, Oil and Gas, AGH Uni-1384 versity of Science and Technology, where he is 1385 currently pursuing the Ph.D. degree with the Fac-1386 ulty of Drilling, Oil and Gas, with a specialization 1387 in mining and geology. In 2017, he realized his 1388 master's thesis in which he developed an algorithm 1389 optimizing the location of new wells. His scientific 1390 research interests include optimization of oil and 1391 gas production. He uses machine learning algo-1392 rithms to improve production from the reservoirs. One of his methods has 1393 been presented at SPE student paper contest, where he achieved second place 1394 in the Ph.D. division. Since November 2020, he has been a member of the 1395 GEIST Team. He was involved in Process-Aware Analytics Support based 1396 on Conceptual Models for Event Logs-PACMEL Project and is currently 1397 participating in the XPM Project (explainable predictive maintenance).