Towards a Framework for Capturing Interpretability of Hierarchical Fuzzy Systems — A Participatory Design Approach

,


I. INTRODUCTION
I NTERPRETABILITY is related to the capability of expressing something in an understandable way [1]. That is, people may say that something is interpretable if they can easily understand it. One of the strengths of Fuzzy Logic Systems (FLSs) is claimed to be their interpretability [2], particularly in applications such as knowledge extraction and decision support [3], [4]. However, key challenges remain in the design of FLSs, such as the fact that the number of rules required commonly increases exponentially with the number of input variables [5]. This challenge also known as rule ex-plosion, sometimes referred to as the curse of dimensionality, can reduce the transparency and interpretability of FLSs [6].
Hierarchical Fuzzy Systems (HFSs) could be a practical approach to overcome rule explosion arising in conventional FLSs [7], [8]. In HFSs, the original FLSs are decomposed into a series of low-dimensional subsystems (see Section II-B). As a result, the rules in HFSs commonly have antecedents with fewer variables than the rules in 'flat' FLSs with equivalent function, since the number of input variables of each subsystem is lower [9], [10]. HFSs can thereby address rule explosion and thus provide a potentially valuable pathway to interpretability in FLSs [6], [11]- [15], [16]. However, whilst the number of rules can be reduced, it is an open question as to how interpretability is affected when systems are hierarchical, featuring various subsystems, layers and topologies. A wide range of basic interpretability indices have been proposed to measure the interpretability of standard FLSs [17]- [33].
However, the determination of which of these possible interpretability measurements is best used in practice remains an open discussion. The problem is that interpretability is a very difficult concept, because of its subjective nature in the sense that it is challenging to know how people perceive interpretability. Whilst an index can be relatively easily calculated, it is extremely difficult to validate any such index even for FLSs. This makes the creation of a measure for HFSs even more difficult. Perhaps as a consequence of this, to date very little (if any) work has been carried out in exploring how interpretability can be measured in HFSs.
Participatory design is an approach that involves the participation of users in the design development process to help ensure that the result meets their needs and is usable in practice [34]. Participatory design has been used to develop solutions to complex problems, especially when dealing with people, such as in control systems [35], educational [36] and medical [37] fields. It provides a methodology towards making the design process co-operative and efficient. Hence, it may provide a method of assessing the interpretability of HFSs. This paper introduces a framework for an index to measure the interpretability of HFSs. A participatory design approach is then used to guide the development of this framework for capturing the interpretability of HFSs, building on initial work to measure the interpretability of HFSs previously proposed by the authors [38]. Naturally, a variety of aspects should be considered in capturing interpretability of HFSs, such as semantic interpretability in the sense of the meaningfulness of the constituent fuzzy sets and intermediate variables. For example as also discussed by Magdelena in [39], if the hierarchical decomposition in the fuzzy system reflects a wellunderstood hierarchical decomposition in the real world, then this is conducive to interpretability. However, in this paper, the framework focuses on addressing key challenges arising from the structure of HFSs. Specifically, it incorporates an elementary index for assessing the interpretability of each subsystem, an aggregation strategy for combining the indices of the various subsystems within a single layer, and a layerweighting strategy that combines layers while capturing the topology of the HFS. Initial demonstration and evaluation using the participatory design approach is presented to compare and configure the framework so as to allow its implementation in practice.
The rest of this paper is organised as follows: Section II discusses relevant background on the interpretability of FLSs, HFSs and the use of user studies. The framework for interpretability of HFSs is discussed in Section III, followed by an outline of how the framework is demonstrated in principle in Section IV. Section V introduces the participatory design process, consisting of two key experiments: (i) comparing H measure with other aggregation strategies in order to capture overall interpretability of HFSs; and (ii) refinement of the H framework around, particularly, the aggregation strategy for combining the sub-system indices within a single layer and the strategy for assigning weights to the layers. Finally, discussions and conclusions are presented in Sections VI and VII respectively.

A. Interpretability of FLSs
In recent years, the interest of researchers in obtaining interpretable FLSs has increased. Substantial research on interpretability measures has proposed a range of alternative interpretability indices for FLSs [17]- [33]. The most common interpretability indices are the Nauck [17] and the Fuzzy index [19].
1) Nauck Index: This is a numerical index introduced by Nauck [17] to measure the interpretability of fuzzy rule-based classification systems. It is computed as the product of three terms (for details of these, see [38]): where: • comp represents the complexity of FLSs measured as the number membership functions (MFs) of output variables divided by the number of input variables in FLSs rules; • cov is the average normalized coverage degree of the fuzzy partition. It is equal to one for strong fuzzy partitions that satisfy all constraints (coverage, distinguishability, normality, etc.); and • part stands for the average normalized partition index.
The partition index which is computed as the inverse of the number of MFs minus one for each input variable. An FLS model is said to be less interpretable when its Nauck index is closer to 0 and more interpretable when its Nauck index is closer to 1.
2) Fuzzy Index: As discussed in [19] and [21], the Fuzzy index, which is inspired by Nauck's index, has been proposed in interpretability assessment, particularly for fuzzy rule-based classification systems. Six variables are taken as the input of a HFS and combined into a single index. The six variables are: (i) the total number of rules (NR); (ii) the total number of premises in all the rules (NP) -in a complete rule-set, this equals the number of rules multiplied by the number of input variables; (iii) the number of rules which use one input variable (NR i=1 ); (iv) the number of rules which use two input variables (NR i=2 ); (v) the number of rules which use three or more input variables (NR i≥3 ); and (vi) the average number of linguistic terms defined for each input variable (terms). The index also depends on the number of classes (NC), also referred to as number of output terms. It should be noted that although the Fuzzy Index is generated using an HFS, it is only designed to measure the interpretability of standard FLSs, and has not previously been applied to HFSs. A Fuzzy index closer to 0 implies that a given FLS is less interpretable, while values closer to 1 imply higher interpretability.

B. Hierarchical Fuzzy Systems
HFSs are characterized by structuring the input variables into a collection of low-dimensional fuzzy logic subsystems, in which the output of each layer is an input to the following layer [7], [8]. Consider a standard FLS consisting of a single layer as shown in Fig. 1. This can be transformed into one of several alternative HFSs, two of which are shown in Figs. 2 and 3.
An FLS that is transformed from a one layer FLS into a multi-layer HFS has a smaller number of rules when considering a fully specified rule base. The most extreme reduction of rules occurs if the structure of the HFS has two input variables for each layer.
In conventional FLSs, the number of rules increases exponentially with the increase in the number of input variables [7], [40]. Suppose there are n input variables and m fuzzy sets for each input variable, then the number of rules (R FLS ) needed to construct a complete fuzzy system with a fully specified rule base (using the 'AND' logical connective) is: In contrast, in an HFS which is fully decomposed into subsystems consisting of two inputs and one output, if we define m fuzzy sets for each input variable and each of the intermediate output variables y 1 , ..., y n−2 , the total number of rules (R HFS ) is a linear function of the number of input variables n [41], and can be expressed as: From (2) and (3), it is clear that the total number of rules in the FLSs (R FLS ) is always higher than or equal to the number in the HFSs (R HFS ). For example, Fig. 1 and Fig. 3 show an FLS and HFS with n = 4 input variables and, assuming that three fuzzy sets are defined for each input variable (i.e. m = 3), the total number of rules for this FLS is R FLS = m n = 3 4 = 81, whereas for the HFS the total Previous research has shown that HFSs can be used to reduce the number of rules in this manner, and claiming to thus improve interpretability [6], [11]- [14]. However, indices for actually measuring interpretability of HFSs were not discussed by any of these authors. As mentioned in [38], there are several challenges in creating methodologies for measuring the interpretability of HFSs: 1) Multiple individual subsystems: As mentioned above, HFSs are produced by structuring the input variables in FLSs into multiple subsystems. Each subsystem commonly has a small number of inputs and outputs and a small rule base, and serves commonly a single purpose [42]. The first challenge may be expressed as "How can the interpretability of each subsystem in an HFS be measured using an index?". This challenge is akin to the principal challenge of capturing standard FLS interpretability using an index.
2) Aggregation: The second challenge is the choice of aggregation strategy to combine the indices of the various subsystems in an HFS. Several aggregation strategies may be suitable, such as mean, min, max and Order Weighted Average (OWA) [43], [44]. An OWA is calculated by reordered subsystems in descending order before multiplying them by the weights. In [45], Yager introduces the linguistic quantifier to calculate weights (w) in which he defines certain values of alpha (α) to capture labels such as "At least one" (α = 0.0), "At least a few" (α = 0.1), "A few" (α = 0.5), "Half" (α = 1.0), "Most" (α = 2.0), "Almost all" (α = 10.0) and "All" (α = ∞). This is done by assigning weights according to: where s is the total number of subsystems. The specific attractiveness of the OWA is that it enables dynamic weighting of the individual interpretability of subsystems (based on the individual interpretability of subsystems such as established by the traditional FLS indices). For example, choosing α = 0.1 results in a weighting strategy closely resembling the max, in which the most interpretable subsystem in the layer is given the highest weight, the secondmost interpretable a substantially lower weight, and so on.
3) Topology and Layering: Based on the same input variables, HFSs with different topologies may be produced, such as the serial and parallel HFSs shown in [9]. A Parallel HFS can have more than one subsystem per layer (e.g.; Fig. 2), while Serial HFSs use strictly one subsystem per layer (e.g.;    Fig. 3 show two different topologies of HFSs using the same four input variables. Both topologies use the same number of subsystems, but with different numbers of layers in their structure. Thus, this challenge can be expressed as "How can the interpretability of HFSs with different topologies and number of layers be measured systematically?".

C. Assessing Interpretability: User studies
A user study allows researchers to identify specific variables that are interesting and observe the impact of varying the values of those variables [46]. Examples of user studies include that of Balazs and Koczy [47] who conducted interviews to ask users to define fuzzy sets, i.e., to get to know what a user meant by 'hot'. Based on the user-defined linguistic terms, fuzzy rules and rule bases can be constructed easily. This was claimed to lead to complexity reduction and improved interpretability.
Mencar and Fanelli [48] conducted a survey with the aim to: (i) give a homogeneous description of all interpretability constraints; (ii) provide a critical review of such constraints; and (iii) identify potentially different meanings of interpretability. Alonso et al. [23] evaluated the most common interpretability indices with a user study (in the form of a web poll) to extract useful information regarding interpretability assessment. The results showed that the Fuzzy index was more easily adapted to the context of each problem as well as the quality criteria of the users. Here, we conduct a user study, inspired by Alonso et al. [23], asking users how the interpretability of given FLSs and HFSs is perceived. However, rather than using the results of the user study to directly evaluate the framework, this paper describes how the data obtained from the user study has been used to guide the development of our framework through a participatory design approach.

III. A FRAMEWORK FOR INTERPRETABILITY OF HIERARCHICAL FUZZY SYSTEMS
A key aspect towards a framework for interpretability of hierarchical fuzzy systems is the need to assess the interpretability of each of its constituent subsystems, present across its layers (as illustrated in Figs. 2 and 3), and then combine these together into a single overall measure of interpretability of the whole system. Clearly, there are many alternative operators that could be selected. For example, it is reasonable to use an aggregation operator that selects something between min and max values [49]. Alternatively, operators which generate results beyond the min and max, such as t-norms or t-conorms, may be applicable. In this paper, our aim is not to identify the best (set of) operator(s); but to put forward one viable strategy towards a flexible framework modelling interpretability in HFSs.

A. The Overall Framework
Following the discussion above, we propose the following high level structure for the framework. Consider H, the interpretability of an HFS, as follows: where: • E jk is the underlying (standard) FLS index associated with the subsystem k at layer j; • represents a general aggregation operator; • l j is the weight associated with layer j of the HFS (see below); • s j is the number of subsystems located in layer j, s is the total number of subsystems; • q is the number of layers of the HFS. Note that E jk could be any index used for measuring the interpretability of a non-hierarchical fuzzy system. In this paper, we neither evaluate or advocate any specific index. However, to illustrate the framework, we use the Nauck (N) and Fuzzy (F) indices on the basis that they are commonly used. Note that (5) returns the original FLS index when applied to a standard FLS because it has only one subsystem and one layer. Further, a linear weighted aggregation strategy is used in (5) to combine layers as the simplest strategy to model varying degrees of importance in respect to interpretability across layers. In future, of course, more complex and nonlinear operators could be explored.
Layer-weights, l j , are associated with each subsystem according to their layer, such that the sum of all layer-weights l j is equal to one regardless of the number of layers q, i.e.: Based on the above, an HFS model is less interpretable when the H is close to 0 and more interpretable when the H is close to 1.

B. Layer-Weighting Strategy
A variety of weighting strategies for the individual layers within HFSs is possible. Here, we briefly introduce a key set of alternatives.
1) Layer Weights Decreasing with Depth: The l j are arranged in descending order. This is intended to reflect the fact that the structure of most HFSs is formed by having the most influential input variables in the first layer of the hierarchy, the next most important inputs in the second layer, as for example in [7], [8]. Hence: In order to achieve this and satisfy (6), l j can be given by: 2) Increasing with Depth: The same principle as above, but with the layer-weights increasing with layer depth. This is indicated that the input variables in the last layer of the hierarchy are most important, as given by: 3) Equal Weighting: Assigning an equal weight for all layers, as given by:

IV. FRAMEWORK DEMONSTRATION IN PRINCIPLE
Following the principle of least commitment, it is intuitive to initially explore the mean as an aggregation operator, to both demonstrate the functionality of the H framework generally and to explore the behavior of the resulting 'mean-based' H in principle. We initially explored this approach in [38], and summarise the approach and results here.
Considering the mean as aggregation operator, (5) becomes: To demonstrate the behaviour of the resulting H mean , we consider both the Nauck and the Fuzzy indices (as the underlying indices applied to each subsystem) using the wellknown Iris flower classification problem [50]. Note that the Iris classification example is used in this paper because it is simple and well understood. It is used only to illustrate  the proposed framework and not to show any benefits of a hierarchical approach over a non-hierarchical one. The Iris dataset has four attributes as input features, namely: sepal length, sepal width, petal length and petal width; and three classes of iris flowers as output, namely: Setosa, Versicolor and Virginica.
We design three individual systems to capture the variety of HFSs' architectures, namely a (standard) FLS (F), a Parallel HFS (P) and Serial HFS (S). The three systems were each designed in two configurations, where all variables have either two or three membership functions -termed F-2, P-2, S-2 (collectively referred to as MF-2) and F-3, P-3, S-3 (MF-3), respectively. The various systems are characterised by seven attributes as follows: 1) Model Type: Type of fuzzy model, namely F, P and S as shown in Figs. 1, 2 and 3, respectively.  Table I. The complete rule set for each of the six variations of the systems are given Tables S-I to S-VI in Supplemental material.

A. Methods
In this section, the application of the H mean framework to the six variations of the Iris system described above is shown in detail. Both the Nauck and Fuzzy indices are used within the H mean framework to enable their comparison. The six systems are then also used in the participatory design experiments described later.
The application of the H mean framework to measure the interpretability of each of the six systems is carried out in the following steps: 1) Calculate interpretability for each subsystem: First, the interpretability of each subsystem is calculated using both the Nauck and the Fuzzy indices. For example, the values of the Nauck index for the three subsystems in P-2 (Parallel HFS with 2 MFs) are N 1 = 0.250, N 2 = 0.250 and N 3 = 0.375 (the details of the calculations are shown in Table II).
2) Identify the layer-weights: Next, the values of the layerweights are computed using (7). For instance, for P-2 and P-3 which consists of two layers, the values of the layer weights at each layer are l 1 = 0.667 and l 2 = 0.333; where for S-2 and S-3 which consist of three layers, the values of layer weights at each layer are l 1 = 0.500, l 2 = 0.333 and l 3 = 0.167.
3) Calculate the overall interpretability: Then, the overall interpretability can be calculated using the H mean as given in (10). For example, the interpretability of model P-2 is computed as follows:

B. Results
The overall interpretability measurements of the six Iris classification systems calculated using the H mean are shown in Table II. In general, it can be seen that the computed H mean interpretability indices in the various hierarchical models are always larger (i.e. more interpretable) as compared to the interpretability of the flat FLSs, regardless of whether the hierarchical topology is parallel or serial, and regardless of the number of membership functions.
As shown in Table II, considering the Nauck index for the two membership function case, the resultant H mean value (i.e. the calculated overall interpretability) is greatest for the parallel HFS model (P-2 = 0.292), followed by the serial HFS model (S-2 = 0.271), and finally the flat FLS (F-2 = 0.047). The same pattern is observed for the Fuzzy index, although the absolute values of interpretability obtained are higher.
Further, as seen in Table II, considering the Nauck index for the three membership function case, the computed interpretabilities are higher for both the hierarchical models (P-3 = S-3 = 0.083) compared to the flat FLS (F-3 = 0.005). However, in this case, the interpretability of both the hierachical models are the same -i.e. the interpretability of the parallel and serial models featuring three membership functions is the same. The same pattern is obtained with the Fuzzy index, albeit with higher absolute values of interpretability.
The results generated for the H mean follow intuition in the sense that the HFSs do have better interpretability than FLS for all systems. Further, the parallel topology, P-2, is seen to have a better interpretability than the serial topology, S-2. This feature is actually due to a combination of three factors: (i) both the Nauck and the Fuzzy indices rate the interpretability of a (sub-)system consisting of two inputs each with two MFs and an output with three MFs (2×2 → 3) higher than that of a (sub-)system consisting of two inputs each with IEEE TRANSACTION ON FUZZY SYSTEMS, VOL. .., NO. .., 2020 6 2 MFs and an output with two MFs (2 × 2 → 2); (ii) the proposed H mean gives higher interpretability to sub-systems in earlier layers; and (iii) P-2 features the (2 × 2 → 3) subsystem in layer 2, whereas S-2 features it in layer 3. This is not repeated in the case of P-3 and S-3, as in these cases all sub-systems are of form (3 × 3 → 3) and so all have equal interpretability; hence, the parallel and serial HFS topologies result in the same interpretability. The Fuzzy index is designed to provide a measurement of interpretability which is closer to the user's point of view than the Nauck or other indices [23]. Given this and our finding that both produce similar results in our H mean experiments, only the Fuzzy index will be used for the remainder of this paper in comparing and refining the H framework.

V. A PARTICIPATORY DESIGN APPROACH
We propose a participatory design approach to compare and derive parameters of H within the framework. As mentioned earlier, participatory design is an approach that involves users in the design development process to ensure the result satisfies their needs [34]. In this section, a participatory design process consists of two main experiments: 1) to assess whether the approach of the H framework, taking into account the topology of connected layers, better matches users perceptions of interpretability, rather than a non-layered approach; 2) to guide the refinement of the H framework through: (i) the aggregation strategy for combining the sub-system indices within a single layer; and (ii) the strategy for assigning weights to the layers. These two experiments are now described in detail using the examples of the Iris classification application and Rotary crane system (as used in [51]), respectively.

A. Experiment 1: The H framework itself
First, an experiment was conducted to examine the measurements of the interpretability of HFSs using the H framework and without the framework, from the point of view of users' interpretability within a participatory design approach.

1) Participatory User Study:
Six of the Iris systems were classified into two groups. The first group was named Set MF-2 which consists of three Iris systems ('flat', 'parallel' and 'serial', with two MFs per variable), termed F-2, P-2, and S-2; the second group was named Set MF-3 which consists of three corresponding systems each with three MFs per variable, termed F-3, P-3, and S-3. Each of the Iris systems in Set MF-2 and MF-3 was printed on an A4 card. The topology, membership functions and rule set of each system was summarised on these cards. For example, card F-2 (as shown in Fig. S-1 of the Supplemental material) contained the topology of FLS as shown in Fig. 1, the two membership functions used in all input variables as shown in Fig. 4, and the complete 16 rules of the FLS.
We carried out this paper-based survey at the Fuzz-IEEE 2017 Conference held in Naples, Italy, during which we asked a sample of participants at the conference to answer a set of questions concerning interpretability. The sample of 25 participants was selected from a range of academics (from doctoral students to full professor), with a range of expertise in fuzzy system design and creation, recruited during the session "Interpretable Fuzzy Systems" and also from other sessions at the conference. The participants were asked to separately rank order the three Iris systems in MF-2 and those in MF-3 based on the perceived interpretability. Users were asked to indicate a rank of 1, 2 or 3, for each of the three systems; with the refinement that they were free to indicate equal ranks for one or more system if they wished -that is, responses such as 1, 1, 1 indicated that all three systems were ranked equally interpretable, or 1, 3, 3 indicated that two of the systems were viewed as being equally less interpretable. Due to this, there may be more or fewer observations of each rank than the number of participants in the study.
The individual responses are shown in Table S-VII (in the Supplemental material), in which the first column indicates the 25 users (referred to as U-1 to U-25), while the second and  (1) third columns show the interpretability rankings for Set MF-2 and Set MF-3, respectively. These results are summarised in Tables III and IV, showing the frequency (count and percentage) of each ranking, together with the average rank, of each system. It can be seen that most of the users found the Parallel HFS to be more interpretable than the flat FLS and Serial HFS, in both Set MF-2 and Set MF-3, with 76% of the users selecting P-2 as the most interpretable of the systems with two MFs, and 72% selecting P-3 as the most interpretable of the systems with three MFs. In both cases of two and three MFs, the ranking of the Flat and Serial systems are less clearcut; in the cae of MF-2, it appears that F-2 may be slightly more interpretable than S-2, whereas S-3 may be slightly more interpretable than F-3.
2) H mean vs 'Mean': This experiment explores measuring interpretability using the proposed H framework (H mean ) compared to not using a framework at all and instead just taking the mean of all the subsystems, regardless of topology (termed simply Mean). Note that our H framework performs averaging of individual interpretability of subsystem at each layer and then layer weighted at each layer, to obtain overall interpretability of HFSs. In contrast, without the framework, the Mean simply treats the interpretability of all subsystems with equal weight regardless of which layer each appears in, the number of layers, etc. That is, the Mean simply averages the interpretability of all subsystems to obtain the overall interpretability of an HFS. Table V shows the interpretability values obtained using the H mean and Mean (i.e. just averaging the subsystems) of the various Iris systems, Set MF-2 and Set MF-3. The resulting rank order of each of the systems is also shown. In general, as can be seen from Table V, the Mean measure produced the same interpretability result for P-2 and S-2; in contrast, the H mean produced different values for P-2 and S-2, indicating that P-2 is more interpretable than S-2, in agreement with the results obtained from users. In the case of Set MF-3, both measures produced the same results in P-3 and S-3. This is because all the subsystems have similar structural characteristics, and hence the same Fuzzy index score (of 0.493, as can be seen in Table II). Thus, any aggregation operators and layer-weighting schemes will also result in the same overall result of 0.493. Whilst these results are insufficient to draw strong conclusions from, this is perhaps a reflection of the fact that the Iris system is too simple, with insufficient degrees of freedom to allow for much variation in alternative hierarchical systems. For this reason, we undertook a further set of experiments on a more complex system.

B. Experiment 2: Beyond the Mean
While aggregating the subsystems using the mean and decreasing weight layer-weight can be used as a default strategy, in order to capture the interpretability of HFSs as perceived by actual users, we propose another participatory design approach to derive H parameters within the framework. In this experiment, we use a more complex set of alternative HFSs, based on the Rotary crane system as in [51].
1) Participatory User Study: Twelve Rotary crane systems were constructed, termed A . . . L. Each system has a different configuration such as the number of rules, number of subsystems, number of layers. Illustrations of the topology of each can be seen in Figs. S-2 to S-13.
Similar to experiment 1 (see Section V-B1), each system was represented on an A4 card. However, this time, we only presented the topology and rule structures. Users were asked to choose which system they favoured in terms of interpretability in a set of pairwise comparisons drawn from the total set of possible pairs. The combination of the pairwise comparisons were selected, as it was deemed impractical to ask users to provide a preference for all 132 pairs, due to the time and effort this would require. The selection of pairs to be used was based on consideration of whether they were felt to be 'not obviously different from each other' and hence interesting and informative to gather preference opinion on. For instance, system A may be paired with all other systems B, . . . , L. However, only the pairs (A,B) and (A,C) were chosen, because they are not obviously different to each other in terms of their structure and number of rules. For instance, for PW-1, the participants were asked to choose between system A and B, based on their perceived interpretability preference (see in Fig.  S-14 for a mock-up of PW-1). If both systems seem equally interpretable, they could indicate 'Equal' (EQ) as their answer. This experiment was carried out through an online-survey with 40 participants from a wide range of expertise.
Table VI presents frequency of the users interpretability ratings for 20 pairwise comparisons. The detail of answers given by each participant to each of the pairwise comparisons are shown in Table S-VIII. From an initial observation, there is appreciable diversity of opinion in the 40 participants as to the interpretability of the various systems. This also shows that interpretability is very subjective because of each participant may perceive the interpretability differently.
2) Exploration of Alternatives Configurations of H: This section was conducted to explore various alternative aggregations and layer-weighting strategies as described in Section III. Firstly, the Fuzzy index for each of the subsystems present in the twelve different Rotary crane system configurations was calculated, as shown in Table VII.   TABLE VI  FREQUENCY OF THE USERS INTERPRETABILITY RATING FOR PAIRWISE   COMPARISONS OF ROTARY CRANE, AS EXTRACTED FROM USER STUDY   Pairwise  comparisons   Users Interpretability Rating   A  B  C  D  E  F  G  H  I  J  K  L  EQ   PW-1  21 11  ----------8  PW-2  20  -12  ---------8  PW-3  -17  8  ---------15  PW-4  -16  -14  --------10  PW-5  -16  --13  -------11  PW-6  --8  21  --------11  PW-7  --10  -19  -------11  PW-8  ---9  13  -------18  PW-9  ---13  -25  ------2  PW-10  ----13 22  ------5  PW-11  -----26 11 Five different aggregation strategies, mean, min, max, and two linguistic OWAs (using alpha (α) of 0.1 and 2) were explored. Each was used as a general aggregation operator in the H framework presented in (5) -note that only the decreasing weight layer-weighting strategy was used in conjunction with the various aggregation strategies. For example, for the case of Linguistic OWA α=0.1 , the values of α = 0.1 will be used in (4) to obtain its weights (w) before multiplying them by the reordered subsystems in descending order. Given that three Fuzzy index values for each sub-system in System F are F 1 = 0.4932, F 2 = 0.1941 and F 3 = 0.4932, the overall interpretability of Rotary crane system F was computed using H with OWA α=0.1 and a decreasing weight layer-weighting, as follows: Meanwhile, in the layer-weighting experiment, the afore-mentioned three layer-weighting strategies, decreasing weight, increasing weight and equal weight as described in Section III-B were investigated. All these strategies were used as the layer-weight l j in the H framework presented in (5)note that only the mean aggregation strategy was used in conjunction with these layer-weighting strategies. For instance, for the case of increasing weight, the values of layer-weight can be computed using (8). For the Parallel models which consists of two layers, the values of layer weights at each layer are l 1 = 0.333 and l 2 = 0.667. Meanwhile, for the Serial models which consists of three layers, the values of layer weights at each layer are l 1 = 0.167, l 2 = 0.333 and l 3 = 0.500. Given that three Fuzzy index values for each sub-system in System F are F 1 = 0.4932, F 2 = 0.1941 and F 3 = 0.4932, the overall interpretability of system F is computed using H with layer-weighting, increasing weight and aggregation strategy, mean can be expressed as follows: The results obtained are shown in Table VIII. From these results, we can see that H framework produced a diversity of answers for various systems, aggregations and layer-weighting strategies. These results were then transformed to obtain the H scores for the 20 pairwise comparisons. For the case of the H mean example in aggregation strategies, the first pairwise comparison is between System A and B. In this example, System B was chosen as it scores higher than System A based on the overall interpretability, indicating that the H framework suggests that System B is more interpretable than System A. The complete results of pairwise comparison for the interpretability of the Rotary crane systems obtained from the H with different aggregation and layer weighted strategies can be seen in Table IX. Whilst the interpretability index is a real number, nevertheless sometimes it produces identical indices for two different systems -in this case, it is labelled in the Table as 'EQ' (equal). In general, systems might be considered equal if the difference were below a certain threshold.
3) Matching H to the Participatory Study: This step was conducted to determine the level of agreement between the interpretability ratings provided by the participatory user study (as in Subsection V-B1) and various alternative configurations of the H framework (as shown in Subsection V-B2).
Specifically, we computed the agreement scores between the results in Table IX with those in Table VI. For example, for pairwise comparison PW-1, the user preferences are A= 20, B= 11 and EQ= 8, as shown in Table VI. Accordingly, from the fact that H mean produces a higher interpretability score for B than A, we deduce that H mean prefers B, and consequently the level agreement score obtained is 11 agreements (as B was preferred by 11 users). Full details of the agreement score are provided in Table X. The last two rows summarise the agreements, providing the mean and standard deviation (SD) for each column.
From Table X, it can be seen that the H min aggregation strategy and increasing weight layer weight strategy achieve the highest average agreement scores. That is, most of the answers given by users are closer to the ratings obtained using H with H min aggregation strategy and increasing weight layer weight strategy.

VI. DISCUSSION
We studied the newly proposed generic H framework through a participatory design process consisting of experiments of which the main aims are; (i) to explore and compare the proposed H measure with other approaches to determining the overall interpretability of hierarchical systems; and (ii) to refine the parameters of the proposed H measure.
In the first experiment, for the first step, a participatory user study was conducted to assess how users perceived the interpretability of the Iris systems. From the interpretability rankings provided by users, we found that the majority indicated that the Parallel HFS was more interpretable than the flat FLS and Serial HFS in Set MF-2 and Set MF-3 with a percentage of 76% and 72% respectively (as shown in Tables III and IV). However, it was less clear cut as to whether the flat FLS was more interpretable than the Serial HFS in Set MF-2 and Set MF-3.
Whilst, for the illustrative example, there is an absence of a clear relationship between the numerical results obtained for Parallel and Serial HFS systems, the comments of users (which can be seen in Supplemental material available with the digital copy of this paper) indicate that the Parallel form is more suited to the example of the Iris system. According to several users, it is more intuitive when sepal and petal are classified separately with the resulting outputs driving the classification of the Iris flowers. Similarly the participants expressed that fewer rules in each subsystem improved their readability. We do not believe that this preference is intrinsic to the Parallel or Serial form of decomposition, but is related to the natural structure (petals and sepals of flowers) inherent in this particular example. 1 In the second step, we explored the interpretability of HFSs using the proposed H framework (H mean ) in comparison to that obtained without the framework, i.e. just using a plain average of subsystems (Mean). The results showed that while the Mean produced the same result for Parallel and Serial HFS (as it takes no account of the number of layers and topology), our framework produces results that are different depending on the topology of systems. The result obtained for the H mean on the Iris system, particularly in configuration MF-2, produces a ranking that is closer to that given by users. Therefore, these observations and current evidence indicate that our H framework (H mean ) is better than a measure without the framework in capturing a natural concept of interpretability of HFSs.  -13  EQ  H  G  H  G  EQ  H  H  PW-14  I  I  I  I  I  I  I  I  PW-15  EQ  EQ  EQ  J  I  EQ  EQ  EQ  PW-16  K  K  I  K  K  K  I  I  PW- Unfortunately, the first experiments undertaken on the Iris system did not have sufficient discriminatory power to help identify the most appropriate parameters (aggregation and layer-weighting strategies) of our framework. A second experiment was therefore carried out to derive the configuration of H framework using a more complex system, the Rotary crane example. Note that this example has lower semantic meaning of its variables compared to the Iris classification used in the first experiment. Nevertheless, due to its inherently higher complexity (featuring six inputs) which means there are more possible hierarchical topologies, the second example has a higher discriminatory power to help identify the most appropriate parameters of H framework. In the first step, we carried out another user study to assess how people perceive the interpretability of twelve different configurations of the system through 20 pairwise system comparisons. Based on the opinions of 40 participants with a range of expertise, a diversity of perception regarding interpretability was found. The results imply that interpretability is very subjective and challenging to understand as views may vary greatly as to the interpretability of different system topologies. In the second step, alternative configurations of the H framework with various aggregation and layer-weighting strategies were used to measure interpretability. It can be seen from Table VIII that this more complex system produces different interpretability scores for almost all the different configurations of the system. The final step is to examine the level of agreement in terms of interpretability between the pairwise comparison produced from aggregation and layer-weighting strategies as in Step 2, with the pairwise comparison obtained from participatory user study as in Step 1. The number of agreements between the users' views in Step 1 and H results in Step 2 show that the H min aggregation strategy and increasing weight layer-weighted strategy produced the highest agreement score with a score of 15 and 17, respectively, when compared with the others. While the differences are relatively small, these results suggest that the H framework with configuration H min aggregation strategy and increasing weight layer-weighted  X  THE AGREEMENT SCORE BETWEEN THE PREFERENCES GIVEN BY EACH OF THE USERS (IN TABLE VI) AND THE PREFERENCE INDICATED BY H  FRAMEWORK (USING DIFFERENT AGGREGATION AND LAYER-WEIGHTED STRATEGIES AS SHOWN IN TABLE IX) strategy as it produced the highest agreement with the users.
The proposed framework and user study raises some interesting issues which are worthy of further and more detailed study. One issue is "How is the experience of the participants measured?", and the associated question "Does it affect the results?". In our studies, we recruited a range of people from early stage PhD students to Full Professors with many years experience of fuzzy systems. However, we did not formally assess their expertise. For obvious reasons, this might be a difficult matter to assess, as individuals may be reluctant to have their 'expertise' measured! Nevertheless, it would surely be interesting to both attempt to measure actual expertise of fuzzy systems (rather than just self-reported expertise) and to explore whether this affects opinion of interpretability in any way. A second issue is "Is there a correlation between interpretability and the classification results?" It has been previously reported that there is a trade-off between interpretability and accuracy [52], [53]. That is, the higher the interpretability of a given system, the lower its accuracy. Since accuracy concerns the ability of a model to make correct predictions, the same correlation may exist between interpretability and the classification results. For instance, if the classification results produce a higher accuracy, the classification result may have lessened their interpretability model. However, in this paper, we are not showing any correlation between interpretability and classification results. We are focusing on introducing a general framework to capture interpretability of HFS.
The study of interpretability, particularly in the context of hierarchical fuzzy systems is an important area, which is likely to gain interest as it has clear relevance to explainable AI (XAI). The studies presented here show that there are sizeable differences in opinion between users as to the interpretability of various configurations of hierarchical systems, including with differing topologies and a range of sizes of rulebase.

VII. CONCLUSION
In conclusion, we have contributed a new generic framework for the measurement of the interpretability of hierarchical fuzzy systems, namely the H framework. This framework allows the use of any index for measuring the interpretability of a flat fuzzy system to be combined in any configuration of hierarchical systems with different numbers of subsystems, organised in differing topologies. We have then presented a participatory design process, consisting of two main experiments which were aiming (i) to measure and compare the proposed H framework measure with others; and (ii) to determine the selection of the best strategies for combining subsystems into an overall index of interpretability. Based on the current evidence, we tentatively suggest the use of the min operator to aggregate subsystems within a layer, together with the weighted mean operator using a increasing weight strategy to combine layers, within the generic H framework for capturing the interpretability of HFSs.
Clearly, further work is also needed to explore the more general question of the wider meaning of interpretability of HFSs. Thus, in future, we expect further development of the H framework exploring other aspects of interpretability of HFSs, including the semantic interpretability of fuzzy sets, that of intermediate outputs and the logical complexity of the rules. For other future work, we will focus on conducting more experiments with different setting involving several case studies with more complex and varied hierarchical systems, including recruiting broader sets of participants from both within and outside the fuzzy community. Moreover, in future, we will also improve the agreement score, e.g. using the Spearman rank-order correlation with real numbers that may explore the difference between the HFS structure and considering the preferences indicated by the framework. In doing so, we would hope to gain further insight into different configurations of the framework, in order to ultimately gain a deeper understanding of the interpretability of hierarchical fuzzy systems, captured in a general index.