Multi-Layer Combinatorial Fusion Using Cognitive Diversity

Multiple scoring systems (including rank and score functions; MSS) have been widely used in multiple regression, intelligent biometric systems, multiple artificial neural nets, combining pattern classifiers, ensemble methods, machine learning and artificial intelligence (AI), data and information fusion, preference ranking, and deep learning. Combining MSS has achieved numerous successful results in a variety of domain applications. However, the reasons why this happens remains an active area of investigation. Combinatorial fusion analysis (CFA) combines MSS using the rank-score characteristic (RSC) function and cognitive diversity (CD). The RSC function was proposed to characterise the predictive behaviour of a scoring system. It was subsequently used to define the notion of “cognitive diversity”, which measures the dissimilarity in the representation of information between two scoring systems. In this article, we first examine characterizations of and diversity between scoring systems. Then, we review combinatorial fusion analysis with a variety of domain applications, including biometric systems in cognitive neuroscience, and joint decision making with visual cognitive systems. Finally, we demonstrate that multi-layer combinatorial fusion (MCF) on the Kemeny rank space is a viable machine learning and AI framework for preference ranking and reinforcement learning. This work provides a scientific foundation and technological insights for the use of Combinatorial Fusion in ensemble methods, data and information fusion, preference ranking, and deep reinforcement learning with applications to a variety of domains in data science and informatics for secure and sustainable societies.


I. INTRODUCTION
According to Jim Gray, in The Fourth Paradigm [2], the scientific discovery process has gone through three phases: (a) empirical (thousands of years ago), (b) theoretical (in the last few hundreds years), and (c) modelling and simulation (beginning about a century ago). Today's scientific inquiry and knowledge discovery is not only data-intensive but also data-centric, which requires the joint effort and fusion of the The associate editor coordinating the review of this manuscript and approving it for publication was Marina Gavrilova . three disciplines: mathematics, statistics, and computation, used in the previous three phases.
The current process of scientific discovery can be characterised as an inductive, rather than deductive, problem [3]. Early scientific discovery, which was also deductive empiricism, focused upon describing homogeneous, or universal, processes, which are used to validly predict singular scores in relation to specific characteristics. One early example of this technique was Galileo's experimental construction of the arc-length of a pendulum, which lead to the property of isochronism [4]. VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ When the complexity of a system increases, the number of necessary parameters to describe the system also increases. Each parameter must be differentially weighted by the exhaustive enumeration across the set of possible points. Within the exhaustive space, scientists are able to determine the effect of X , a superset of the possible feature spaces, on Y , the target result of a function that we attempt to approximate. The expectation of performance is realised as a form of a conditional random variable. Hence, the uncertainty in predicting the state of a system is reduced as a consequence of the increased ability to predict how the system is expected to perform with the introduction of new data. Typical examples include scoring systems in biometric systems [5], [6], information retrieval [7], internet search, and search engine optimisation, as well as biomedical informatics, chemoinformatics, virtual screening and drug discovery [8], [9]. Two particular examples are the score function used by Google to determine the relevance of page presentation as preference learning [10], protein structure prediction, and protein-ligand interaction score functions in pharmaceutical settings [11]- [13].
Many problems in the science, technology, engineering and mathematics (STEM) areas and social-ecosystems emphasise the relevance, closeness, or similarity between two data measurement vectors of subjects, objects, or entities. Similarity is also a well-studied problem in the field of functional approximation, allowing us to view a system in a specific, quantitative and convex framework. Methods are not initially constructed to measure the explicit probability of success or the relevance to the target prototype that a score function s A of a scoring system A was trained upon. The initial desired goal is rather to provide an optimal and/or desired ordering of the candidate data items that can serve as a proxy for preference ranking of the data items. This can be accomplished by sorting the implicit score values in descending (or ascending) order and assigning natural numbers to that rank order [14]. Our perspective depends upon the fusion of the mathematical, statistical and computational approaches, including the 'Two Cultures' as introduced by Breiman [15], wherein we treat the data models themselves as fundamental and necessary. However, each model has its strengths and weaknesses in the process of acquiring generalisable knowledge, as characterised by the maximisation of information by model fusion.
In this article, we focus on the RSC function f A , which characterises the information-based relation between the scores and their orderings (ranks) of a scoring system A. Here we present a new paradigm that exhibits an analogy between the scoring system A on the data items D = {d 1 , d 2 , · · · , d n } with s A and r A as defined, and the variable X (A), on the data set points in D with Euclidean score values s A (d i ) and the rank values r A (d i ) at the data point d i . These rank values constitute a permutation of the set of natural numbers N = {1, 2, 3, · · · , n}. Current work provides a new paradigm for machine learning and artificial intelligence (ML/AI) with respect to general purpose deep reinforcement learning on the Kemeny rank space, which allows tied rankings. A variety of diverse application domains are discussed, including data science, informatics, biomedicine and cheminformatics, data and information fusion, virtual screening and drug discovery, ensemble methods, information and cyber security, and model fusion.

II. BACKGROUND AND OUTLINE OF CURRENT WORK
When a score function, s A , of a scoring system A within the candidate data space D is converted (sorted) into a rank function r A , potential loss of information in the reduction from a measurable scoring space to a ranking space has historically been often discussed but little acted upon. When represented as a rectangular matrix, the data space D is of order n × t, reflecting the row space where each of the n rows is a single data item, and t is the dimension of the system space. The early work of ordered similarity measures in non-parametric space, by [16], resulted in a widely applicable measurement of normed similarity, analogous to the Euclidean Pearson's r. However, the lack of a complete ordinal metric topology, especially in the presence of ties (data items d i and d j have the same value s A (d i ) = s A (d j )), led to an inability to follow upon the early successes after the 1970's. Recently, rank aggregation, model fusion, or ensembles of model systems, have been applied many practical uses and demonstrations of utility at the cutting edge of mathematical development. These have produced, in a number of data mining competitions, academic disciples, and industrial sectors, strong accolades not as a result of the specific model construction techniques, but the ensembling techniques themselves [17]. When the available data grows in size and complexity, the need to operate upon the expanding domain of X in a computationally meaningful approach has resulted in a black box perspective with respect to the local function approximations. This situation is further complicated when multiple divergent systems are combined to produce singular approximations of Y .
In what follows, Section III covers the characterisation and properties of the RSC function f A of scoring system A in informatics and its counterpart, the cumulative distribution function (CDF) in statistics. Section IV covers correlation and diversity between scoring systems, focusing on three estimators of statistical data similarity: the Kolmogorov-Smirnov (KS) statistic, the Kullback-Leibler (KL) divergence, and the Cramér-von Mises statistics in light of the RSC function. Section IV also covers various computational information diversity measures as defined in machine learning and artificial intelligence (AI) [37]. In particular, CD(A, B), which we define to measure the dissimilarity between two scoring inputs A and B, is compared to the notion of correlation (Pearson's correlation between s X (A) and s X (B) , both Spearman's ρ and Kendall's τ rank correlation between r X (A) and r X (B) ) in statistics. In Section V, we review the approach of Combinatorial Fusion Analysis (CFA), which combines multiple scoring systems using the RSC function and CD. Finally, in Section VI, we discuss and demonstrate a recent advance in multi-layer combinational fusion (MCF) on the Kemeny rank space as opposed to the traditional deep learning practices on the Euclidean space.

III. CHARACTERISATION OF THE SCORING SYSTEMS
Given a scoring system A with score function s A and its rank function r A , the RSC function f A , as shown in formula 1, can be derived. In addition, a scoring system A (see III-A) can be characterised w.r.t. statistical distribution and matrix (subsection III-B), permutation group (subsection III-C), and RSC function (subsection III-D).

A. SCORING SYSTEMS
An approximating model or scoring system A on items in a data set D = {d 1 , d 2 , · · · , d n }, consists of a score function s A : D → R which assigns a value s A (d i ) to the data item d i . Given a score function s A of the scoring system A, a rank function r A : D → N, where N = {1, 2, 3, · · · , n} = [1, n], can be derived by sorting the score values s A (d i ), for all d i ∈ D, into descending order. Under this interpretation, a rank-score characteristic (RSC) function f A : N → R was defined [1] as a mapping from the ordering of the score values upon the n score values with The RSC function f A of a scoring system A has three distinctive characteristics that we formulated as the following three remarks: Remark 1: For a scoring system A with score function s A and its derived rank function r A , the rank-score characteristic (RSC) function f A : N → R has either one of two possibilities: (a) f A is a monotonically decreasing function, or (b) f A is a non-increasing function. More specifically, case (a) consists of scoring systems without tied score values, while in case (b), tied score values (and hence tied rankings) are allowed.
Remark 2: Since the RSC function f A for the scoring system A is from N → R, it possesses the following two fundamental properties: (a) The removal of the data items in D as an  Table 2.
explicit middle transition step between the conversion of the scores to rankings leads to a function that is based on N, not on D; and (b) For two scoring systems A and B on the same domain data set D, the relationship between A and B defined using f A and f B is independent of the data set items d i in D.
Remark 3: For two scoring systems A and B on the dataset items D, we define rank combination C as then under certain conditions including the different functions f A and f B , rank combination C is better than score combination D [1], [38].
Here we give one example. Let D = {d 1 , d 2 , · · · , d n } be a set of n data items and N = [1, n] be the set of all integers from 1 to n. Let A and B be two scoring systems with score function s A and s B , and rank functions r A and r B , respectively. Let f A and f B be the rank-score characteristic (RSC) functions of A and B, respectively. Tables 2 and 3   The RSC function f A characterises the scoring system A, which is obtained by formula 1. It provides the relationship between the ranks and the score values of the scoring system A (See Figure 2 for RSC function graphs f A and f B , respectively). In what follows, we review three other different types of characterisations: the cumulative distribution function (CDF), the score matrix, and the permutation.

B. CDF & SCORE MATRIX
Treating normalised score function s A of a scoring system A as a random variable x A , we can construct a cumulative distribution function such that Figure 1 depicts the CDF function graphs of X A and X B . There are several different ways of using matrices to characterise the scoring system A. First, the traditional Kendall's τ VOLUME 9, 2021 FIGURE 2. RSC function graphs for scoring systems A and B in Table 3.
has the form for scoring system A without ties: Then the extended Kendall's τ b matrix representation which can handle ties [39]: In their study of consensus ranking problems, Emond and Mason [40] used the Banach inner-product nature of the Kemeny rank space τ x , same as d H in Section VI, to characterise the scoring system A: This characterisation is similar to the description and proofs of [41], utilising the Borda count for Mixed Group Ranks fusion, with the combination function tending towards the expectation of the group, defined as the median of the set of scoring functions. The Borda count is, however, a noncompliant Condorcet technique, in that it is expected, but not guaranteed, that scoring systems should rank candidates as a linear combination that produces a one-to-one correspondence between the highest composite scores and the highest rankings. However, the Borda count fails to satisfy this criterion in certain specific conditions, wherein a lesser scoring candidate combination may be ranked higher than a higher scoring candidate. Our utilisation of the Kemeny metric (See Section VI), which is linearly invariant under monotonic transformation, allows our combinations to reduce  the potential for Condorcet failures occurring in the presence of ties upon scoring systems.

C. PERMUTATIONS
A rank function is really a permutation, with potential ties, of the data. The permutation unit distance under the Kendall metric, defined upon S n , provides a combinatorial and computational framework in which distances are computed across the comparative number of swaps that are necessary in order to recover the target ordering [14], [42]. As the score matrix (equation 5 and equation 7), rather than the scores themselves, are the comparative basis upon which all unit evaluations (i.e., errors in prediction) are computed, the inclusion of a proper metric analysis, with ties, sub-additively decomposes the bias realised in [42]. This, in turn, enables construction of a linear function of the cumulative distribution function across the ranks of all m units that is optimal in the Gauss-Markov sense. Thus, independent of actual construction of the target score approximation, which is orderable, the permutation measure determines a compact and comparable measure space.
For example, consider Figure 3 and Figure 4, which depict the permutation ranking of n = 4 elements upon the complete symmetric permutation space. This restricts the score realisations to be monotonically invariant, such that the lowest rank denotes the highest score realisation upon the symmetric group space S 4 with adjacency defined as a swap of two adjacent elements. In general, S n with the metric Kendall's τ , is called a bubble-sort graph (B n ) [43]. In said graph (B n ), no tied rankings may occur. Figure 3 shows that B 4 can be constructed using 4 copies of B 3 . Figure 4 depicts the layout of the 4! = 24 nodes w.r.t. the distance from the identity permutation I n = 1234 or its inverse I n = 4321.

D. RSC FUNCTION
Rank-score characteristic (RSC) function as defined by [1] is the function f A : N → R that assigns the score value of a data item to the rank of that data item under the scoring system A. In mathematical terms, the composite function of s A and r −1 A , is constructed in the following way: As discussed earlier in this section, the RSC function of a scoring system A characterises the scoring system A analogous to the role played by the cumulative distribution function (CDF) and scoring matrix (M τ a , M τ b , M τ x , in Section III-B) that characterises Euclidean spaces with parametric Euclidean score values and non-parametric Kemeny rank values of the scoring variables x A , respectively, in computation.
We note that given the score function s A of a scoring system A, the RSC function f A can be computed efficiently [38], [44]. More specifically, f A is obtained by sorting the score values {s A (d i )|d i ∈ D} using the rank value [1, n] = {1, 2, 3, · · · , n}, with n = |D| as the key.
In informatics, diversity has been studied in combining artificial neural networks [58], combining pattern classifiers [59], complex systems and ensemble methods [47], [60], [61], and combining multiple scoring systems [1], [18]. We illustrate more details in the following two subsections: statistical data correlation and computational information diversity, in the context of a pair of scoring systems A and B.

A. STATISTICAL DATA CORRELATION
For two scoring systems A and B, statistical correlation focuses on the correlation between data distributions of the random variables x A and x B . In this regard, we begin with the empirical Euclidean definition of distance for the Pearson correlation (equation 8) and compare it to the Spearman data correlation (equation 9), where in equation 9 the Spearman footrule is defined in which the squared difference between the rankings of the two scoring systems upon each of n observations, in addition to the previous metrics in equations 3,4, and 6. When performing statistical tests within the rank space, the Kolmogorov-Smironov (KS) statistic [53], [54] is an easily understood measure. In the following discussion of statistical methods, we note the equivalence between x = In the context of the RSC function for each of the scoring systems A and B [1], the KS statistic computes the supremum of the differences in ranks between two ranks f −1 A and f −1 B for each score value x ∈ (0, 1): where F A , F B , f A and f B are all the cumulative distribution functions of X A and X B (equation 2 and Figure 1) and RSC functions of A and B (equation 1 and Figure 2), respectively. Importantly, while the score functions may be non-linearly related to the feature spaces of interest, rankings are linearly monotonically invariant function spaces [62], allowing for more fundamental distributional characteristics to be defined without the embedding complexity otherwise necessary upon the Euclidean plane. The Kullback and Leibler divergence [55] can also be understood in the context of the RSC function. The expression measures the divergence from the function approximation distribution X B to the target X A , weighted by the potential score difference in magnitude consequent from the choice of difference in Euclidean parametric formulates from B under A: where F A and F B : The Cramér-von Mises criterion [57] operates upon an identical domain as the KS-statistic, supplanting the ∞ -norm with the 2 -norm.
with the same defined domain and range as the RSC function, from which it can be easily seen that the Cramér-von Mises statistic is equivalent to the total marginalised distance between the two scores under both functions over all data items in the sample, thus representing the total dissimilarity between a sequence of n scores for any empirical finite sample. It can be easily understood that the term 2i − 1 2n VOLUME 9, 2021 in ascending order with respect to i reflects an intrinsic ordering which is present, as proven before, within any cumulative distribution function. The existence of the ratio 2i 2n − 1 2n allows for the assigned probability of comparison to be a permanently increasing value whose asymptotic limit is bound to the support (0, 1], purely as a function of n, the sample space. This direct orderability of both allows for a comparison to be made across common points of the area between the two curves, which thereby allows us to determine whether the proportion of the distribution function indexed upon the rank space, that is less than a given threshold, is roughly equivalent to the theoretical Euclidean object of comparison. It also allows for the first term to be treated as the rank score function, upon which the rank with respect to i and the aggregated score 1− 2i−1 2n up to point i possess a proportional equivalence. In the left of Table 1(a), we provide a number of different score based statistical data aggregations marginalising over n, which allow for the predictive similarity of bivariate scoring systems to be compared.

B. COMPUTATIONAL INFORMATION DIVERSITY
In informatics and in computational science, scoring systems (with score functions and/or rank functions) have been used extensively. In data mining, machine learning, and data and information fusion, the notion of ''diversity'' between two scoring systems has been widely discussed and used in the context of artificial neural nets [58], pattern classifiers [59], complex systems [61], ensemble methods [47], and multiple scoring systems [1], [63].
In [59], diversity in classification ensembling was compared to diversity in biology, software engineering, and statistical measures of relationship. Kuncheva reviewed pairwise measures and six non-pairwise measures and discussed the relationship between diversity and accuracy in the context of combining pattern classifiers. Reference [19] showed shared relationships between various diversity measures in multiple classifier systems, and [63], [64] established a diversity-performance relationship for majority voting and plurality voting in classifier ensembles.
Diversity in ensemble methods [47] among the individual learners has been considered as a fundamental issue and is crucial to the accuracy of the ensemble. Although many diversity measures have been proposed and developed in the field of ensemble methods, the right formulation and measures for diversity have not been resolved yet [37], [47], [59], [65], [66].
In the field of complex systems, measuring diversity exists in three categories: diversity within a type, such as variates, diversity across types, such as entropy and attributes, and diversity of community composition such as population [60], [61]. Complex systems exist in a variety of domains, including ecological systems, economic systems, financial systems, political systems, and biological systems. In the right of Table 1(b), we provide a number of different approaches to computational information diversity, which allow for the predictive similarity of bivariate scoring systems to be compared.

C. COGNITIVE DIVERSITY
CD was proposed to measure the dissimilarity between two scoring systems A and B [1], [18]. More specifically, it was calculated using RSC functions f A and f B of A and B [67]. For example, cognitive diversity between A and B, CD(A, B), is computed as in formula 14 [8], and utilised as depicted in Figure 5 to depict the relationships between A and B [67]. (14) It was shown [38] that under certain conditions, rank combination performs better than score combination. These conditions for which better performant combinations are obtained in favour of larger cognitive diversity between A and B. More details on the notion of cognitive diversity and its domain applications can be found in [67].

D. EXAMPLES
In information retrieval systems, each of the two search algorithms uses similarity scoring systems A and B, respectively. Let D be the set of documents D = {d 1 , d 2 , · · · , d n }. Tables 2  and 3  For the pair of scoring systems A and B as illustrated in Table 2, we list the diversity of A and B, d(A, B) in the context of the following fifteen measurements related to this section.   Table 1(b), consists of seven computational information diversities [40], [46], [48], [50]- [52], [56], [65], [68]- [70]. For each measure, we also indicate the range of measurement and its reference in the table.

V. COMBINATORIAL FUSION ANALYSIS
In this section, we review the field of Combinatorial Fusion Analysis (CFA) proposed by [1], [18] for optimally combining multiple scoring systems. CFA was founded upon the concept of using the ''rank-score characteristic (RSC) function,'' to characterise a predictive engine, and the concept of ''cognitive diversity'' was introduced to measure the dissimilarity between multiple scoring systems [1], [18], [67]. One of the novelties of the CFA approach is the use of various combinatorial methods including both rank and score combination [18], [67] to produce predictive model combinations whose predictions are both more accurate and more robust to cross-validation [30].
Combining multiple scoring systems (MSS) has become an emerging field of machine learning, AI, forecasting, and predictive analytics. It has been shown to be useful in a variety of domains, such as combining pattern classifiers [59], weighted score and rank combination [18], [67], ensemble methods [47], data fusion in information retrieval [7], multi-indicator systems [70], and online learning algorithms [71].
Given t scoring systems A j , j = 1, 2, · · · , t with score functions s A j and rank functions r A j respectively, we have either score combination SC( . Allowing SC = SC(A * j ) and RC = RC(A * j ) to be the score combinations and rank combinations of A * j = {A 1 , A 2 , · · · , A t }, we have and as the score functions of SC = SC(A * j ) and RC = RC(A * j ), respectively. Performance of said scores are evaluated using the permutation distance between the target permutations and the model combination permutations, by means of the Kemeny metric (see Section VI). The Kemeny metric resolves the biased performance of the Kendall's τ b distance as determined in [42], [43], allowing for a performance evaluation and probability distribution to be determined for any monotonically non-decreasing cumulative distribution of scores and score combinations.   by performance (WCP) (3) weighted combination by diversity strength(WCDS), (4) geometric mean, (5) mixed group rank, and/or (6) other combination method, on the bubble-sort Cayley graph B n and the Euclidean space E n = R n for rank combination and score combination, respectively. Each of these methods (1)-(4) has both rank and score combinations while method (5) has rank combination only.
Combinatorial fusion consists of a scoring system A on the domain set D = {d 1 , d 2 , · · · , d n }, consists of a score function s A : D → R which assigns a score (real numbers in R) to each item d i of the domain set D and a rank function r A : D → N which assigns a rank (natural numbers in N : [1, 2, · · · , n] , n = |D|). The rank function r A (d i ) is the result of sorting the score values s A (d i ) in descending order and assigning rank to each data item accordingly. In Subsection V-B, we illustrate more details using two domain examples from cognitive neuroscience: preference detection based on eye movement in V-B1 [30], and joint decision making using visual cognitive systems, found in V-B2 [28].

B. DOMAIN EXAMPLES
We first illustrate the above discussion with an example in the information retrieval domain before we get to the two examples in V-B1 and V-B2. Let D = {d i } be the domain space where i ∈ [1, n], n = 20. Let A, B, C be three retrieval systems with score functions s A , s B , and s C and rank functions r A , r B and r C , respectively. Table 4 lists the score functions and rank functions. Table 5 gives the list of normalised score functions s A , s B , s C , along with respective ranks, while Table 6 lists the RSC functions f A , f B and f C , respectively. Figure 7 plots the RSC-function graphs for f A , f B , f C on the  Table 4. same space. Figure 6 illustrates the performance of all combinations by rank or by score.

1) PREFERENCE DETECTION BASED ON EYE MOVEMENT
Cognitive neuroscience researchers are similarly often interested in modelling and understanding decision processes, which are reflected in neurological electrical activation and eye movement capture, for contexts such as text comprehension. In eye movement tracking [30], the 'gaze cascade' effect was observed, in which a subject is given two images and prompted to select their preference. The subject's variance in focus is initially evenly distributed, but the gaze focus is found to follow a Pareto distribution and converges as time increases, for which the focus modality corresponds to the preferred image. In the eye tracking trials, the x,y coordinates of the subjects' gaze locations, along with the duration of the gaze at the point, are collected. Based on this data, the following features were constructed: duration of focus in the last 200ms of a trial (A), total duration (B), gaze point count (C), interest sustainability (D), and count of edges between face regions (E).
Preliminary analysis found that, with 90% accuracy, the selection of the individual face deemed most attractive by the subject was determined by the visual focus in the final 200 milliseconds, which is represented by feature A. From this observation followed the construction of the five features recorded for each of 720 trials T = {t 1 , t 2 , . . . , t 720 } (see features A, B, C, D, and E in Figure 8). Univariate and pairwise score combinations for 5 2 of the features are recorded in Table 7, with precision recorded on the left-most column for each approach with the corresponding RSC function found in Figure 8.

2) JOINT DECISION MAKING USING VISUAL COGNITIVE SYSTEMS
In this section, we revisit the task of combining multiple scoring systems in exploring the generalisable computational intelligence as a consequence of CD. It has been observed that non-redundancy (i.e., dissimilarity) in well-performing scoring systems tends to improve overall system performance. This can be mathematically expressed by an algebraic distance formula. For example, the information, presented as a posterior probability distribution of the scores, and the corresponding integration over said space, thereby defining both the probability distribution function (PDF) and cumulative distribution function (CDF) of the two systems A and B are compared and combined. Then the redundant learned structure, or overlapping information, is removed from each system, producing a measure of the distribution of distinct information contained in these two systems, as reflected in the differences in respective predictions.
In joint decision making problems, a decision maker is often presented with various expert decisions, for which they must choose to combine together a subset of the recommendations to make a decision. In a real case experiment, the authors [28] selected two candidates, who observed a small projectile being tossed onto an open grassy spot. The observers were asked to indicate the location at which the token landed, and to provide their confidence (as measured by the size of radius of about their chosen location in which they were most confident the token lay) in their decision. For 96 distinct combinations of design factors, 34 cases were equal or better performant than the best decision of the two subjects. The results of this experiment are provided in Table 8, with the different weighting systems assigned to the confidence radius referred to as M 0 , M 1 and M 2 .

VI. MULTI-LAYER COMBINATORIAL FUSION ON THE KEMENY RANK SPACE A. RANKING UPON THE BUBBLE-SORT CAYLEY GRAPH B n
Each of the scoring systems A and B on data items d i ∈ D represents a respective score function s A and s B and rank functions r A and r B . The rank function r A (or r B ) : D → N with r A (d i ) ∈ [1, n] is the rank order of the data items d i in D. In other words, the rank function , can be considered as a VOLUME 9, 2021 TABLE 7. Performance of feature scoring system combinations [30]. permutation of the data items [i = 1, 2, · · · , n]. As such, r A is an element of the symmetric group of order n (i.e., S n ) which consists of all permutations of the elements [1, 2, · · · , n]. Let T i be the operation which consists of any of the adjacent swaps (i, i + 1), for i = 1, · · · , n − 1. The symmetric group S n together with the operations T i , i = 1, 2, · · · , n − 1, form a graph called the bubble-sort Cayley graph, (S n ; T i , i = 1, 2, · · · , n − 1) where two nodes (permutations) A and B are adjacent to each other if and only if A • T i = B or T i • B = A for any i ∈ [1, n − 1]. It is straightforward to show that the distance between two nodes A and B is the number of adjacent interchanges between A and B.
The RSC function of a scoring system A builds upon the algebraic representation of the CDF, providing a comparative basis across different fields reflecting the range of the learning functions in which we are interested. RSC functions that are both ordered and scored identically are, unsurprisingly, identical functions. For two rank functions r A and r B as two permutation arrays of scoring systems A and B of the n elements [1, n] = {1, 2, 3, · · · , n}, the following are equivalent: dist(A, B) = is the number of adjacency swaps between systems A and B; 2) Kendall's τ a defined in Section III-B, formula 3; 3) the number of interchanges between adjacent elements from array A to array B, by using bubble-sort algorithm.
Although statements (1) and (2) y). Statement (3) indicates that the distance between node (permutation) A and node (permutation) B can be computed by sorting the array A to become B and counting the number of interchanges between adjacent elements.
The embedding of a metric space upon a graph allows for the visual depiction of relations between measurements and predictions independent of the specific nature of the relations. This graphical structure allows for a vast array of computational applications to be undertaken [72]- [74]. The domain of knowledge and information representation [75]- [77] allows for the network structure of the relations between nodes in the graph to exist without a necessary linear parametric score relation.
When performing either rank combination or score combination, the resulting scoring system may have tied scores among different data items. Hence the resulting rank function of the combinations are not permutations and hence not in the bubble-sort Cayley graph space G = (S n ; T i , i ∈ [1, n]). In this regard, Kemeny and Snell [78] formalised the metric d K as the distance between rank functions A and B using the τ b characteristic score matrix (formula 4 in Section III-B) as follows: where the characteristic score matrix M τ b (A) is as defined in formula 4. It was shown by [78] that this metric in formula 17 satisfies a list of axioms which lead to a metric space H n that includes rank functions of the scoring systems with ties. However, since this computation involves absolute values of differences, the application of this metric is hindered in contexts for which large data sets are conventional [40].

B. THE KEMENY RANK SPACE H N
Emond and Mason [40] proposed formula 6 to characterise a scoring system A and then constructed an alternative representation of the correlation τ x (A, B), as found in formula 6 with the score matrix in formula 7. They also showed by defining the concept of 'half-flip' that the Kemeny rank space H n with τ x (A, B) as the metric satisfies these axioms defined by [78], by leveraging the Banach inner-product nature of the ultrametric space. Figure 9 depicts the Kemeny rank space H 3 with half-flip as an unit of distance [78], [79]. We note that H 3 , with 13 nodes and 18 edges, is an extension or a sup-imposed version of the graph S 3 which has 6 nodes and 6 edges. We further note that the number of nodes p in H n is as follows [80]: where n b is the Stirling number of the second kind, which is the same as the number of ways to partition a set of n objects into b non-empty subsets. Table 9 lists the numbers p(H n ) for n = 2, · · · , 10.

C. MULTI-LAYER COMBINATORIAL FUSION (MCF) -THE ER-ALGORITHM
With the Kemeny rank space H n and the matrix d H = τ x (A, B) in place, we aim to provide an evolutionary algorithm on H n using multi-layer combinatorial fusion. We focus first on an expansion-reduction (ER) computational algorithm, which consists of three steps.
The CLEAR(M ; D; n, t, l) Framework: The framework uses the multi-layer combinatorial fusion (MCF) to conduct the expansion-reduction (ER)-algorithm on the Kemeny rank space H n : 1) Expansion: a) Generate all the 2 t − t − 1 combinations (2-com, 3-com,. . ., t-com) using a variety of combination methods and diversity measures (e.g., methods of combination M = {1, 2, 3, 4} and diversity measurement D = {d H , CD} in Section V-A). b) Generate the mixed group rank combination [41]. c) Each of the four methods of combination uses both weighted combination by performance and geometric mean combination [81]. In total, 8(2 t − t − 1) new scoring systems are obtained.
2) Reduction: Pick top q rank orders, 0 ≤ q ≤ 2t, which are better than the t rank orders in previous step, from the 8(2 t − t − 1) + 1 rank order obtained in Step 1. a) If q = 0, stop; If 0 < q < t, go to Step 1; If t ≤ q, go to Step 2.b; b) Calculate diversity, using either d H or CD, between each pair of the q · (q − 1)/2 pairs of rank orders; c) Calculate performance(using d H and diversity strength (using either d H or CD) of each of the q rank orders; d) Using the sliding rule to pick the top t rank orders that has the highest performance and diversity strength. 3) Go to Step 1. 4) l is the number of iterations of the expansion process in Step 1. The CLEAR(M ; D; n, t, l) framework works as follows. Starting with t rank orders (with ties allowed) in H n , we begin with expansion, generating e(t) = 2 × (2 t − 1 − t) + 1 = 2 t+1 − 2t − 1 rank orders, by three combination methods: 1) 2 t − 1 − t weighted combinations using performance as weights [63] 2) 2 t − 1 − t combinations by geometric mean [81] 3) combination by mixed group rank [41]. The second step in the CLEAR framework, reduction, first selects the top q(∼ 2t) performing subset from the e(t) rank orders, computes the ''diversity'' between every pair of these (q(q−1))/2 elements, and calculates the ''diversity strength'' for each of the q rank orders by the average of the CD between this rank order and other q−1 rank orders.
Step 2 produces the top t rank orders from the two q rank orders using the ''sliding rule'' upon the joint performance and diversity strength. The reinforcement ER-algorithm continues as a multi-layer combinatorial fusion process in l number of times (layers) until one of the following stopping rules is met: 1) none of the top q ∼ 2t rank orders in the expanded set of rank orders is better than the best of the initial t rank orders, or 2) the expanded set e(t) rank orders converges to less than t rank orders. VOLUME 9, 2021  A and B, d(A, B) can be calculated either by the distance between A and B on the Kemeny rank space, d(A, B) = d H (A, B), or by the CD between the RSC functions, d (A, B) = d(f A , f B ). For the set of top q rank orders from the e(t) expanded rank orders, diversity for each of the q(q − 1) pairs of rank orders can be obtained. Diversity strength of each of the q rank orders ds(A) is calculated as the number of pairs on which A appears from the relatively top (meaning there is a sharp drop, or scree) pairs with high diversity. The q rank orders can be listed in two columns using performance and diversity strength in descending order. A sliding rule is placed on each row until t rank orders are obtained. In the ER-algorithm process, performance is meant to be how close the rank order is to the identity permutation by calculating the d H distance.

Diversity between a pair of two rank orders
In case of a tie, the median rank is assigned to all three rank positions. This reinforcement learning algorithm is similar to an evolutionary computational algorithms. An initial population of scoring functions is generated, upon which the predictive performance is then evaluated. The top performant rank orders are selected.
Optimisation under any complete metric sub-space enables linear transitive (or sub-additive) convexity to hold. This allows for convex optimisation upon the entire space to be established for any well-posed problem space with a unique solution, with all requisite maximum likelihood properties (e.g., minimum error and maximum information; [14]). As long as the variance-covariance matrix (reflective of the statistical correlational matrix) is positive-definite, it follows that the optimal combination of the learning systems, as proposed in this system, is expected to be linearly convex and well-posed, and is therefore resistant to over-fitting in the mathematical sense, regardless of the parametric realisations of the score distributions themselves in the Euclidean subspace.

D. SIMULATED AND EMPIRICAL EXAMPLES
An exploration of the CLEAR model was reported in Zhong et al. (2019) [43]. It presented two simulations of MCF in H 10 and H 300 and an empirical application developed upon a protein-ligand virtual screening paper using deep-learning neural networks [12] to demonstrate the conceptual framework developed and expanded in the CLEAR framework. These two simulated cases H 10 and H 300 and the empirical example in virtual screening and drug discovery are included in Tables 10,11, and 12 [43]. Specifically, we demonstrated how the recursive ultrametric structure of the graph depicted in Figure 4, providing a convex and relatively quickly solved framework from which results a marked improvement of performance which surpasses the initial input models. Here, Index denotes the 300 unique identifiers corresponding to the neural-network systems provided by [12], and the Input denotes the system error for each neural-network system, or their aggregate performance under linear combination. These conventions hold for Tables 10,11, and 12, with each numerical value denoting the total error variance, with improved system performance reflected as the value approaches 0 from the right.
For both H 10 and H 300 , 125 rank orders upon the bubble-sort Cayley graph were selected with rank distance d H greater than 40% and less than 50%, upon which 150 cases were selected. These performance bounds were selected to ensure that the worst case performance be explored, bounded by error rates of between 40% and 50%. They would allow us to characterise the worst-case performance improvement observed within our Fusion framework, relative to I n . The empirical example utilised the 12 top performant functions reported from the 100 in [12], which are all no less than 87% accurate in performance, as they are the results of a deep learning computing experiment.
In the rank space H 10 , the 150 rank orders were partitioned into 28 groups of 6 models each, with few duplications of the models across all 28 groups. The initial distances from the input models were found to be 44.4% -48.9% accurate, while producing results in [8.89%, 42.22%] achieved in between 1 and 5 layers with computational time for all cases (sec) in [1.47, 7.68]. These results for H 10 are found in Table 10. Upon the rank space H 300 the 150 rank orders were composed into 50 groups of 6 each, with some duplication. From this set of 50 groups, 5 were subset as given in Table 11, for which the distances of the initial 6 were within [49.949%, 49.996] distance performance, resulting in an output distance of [46.253%, 48.495%] for either two or three layers with computational time spent (min) in [20.29, 31.64], as provided in Table 11.  The empirical case containing the top 12 performant models were found to possess a distance d H from the target in [12.6466%, 13.4448%], from which 5 groups of 6 models were chosen to reflect the best 6, worst 6, and a random subset of six. These produced results of combination which were of distance to the target d H in [11.766%, 12.022%], with only two or three necessary layers which were found to converge (in mins.) in [39.64, 47.74] as reported in Table 12.
It is important to note from these three examples that even though the decrease in performance percentage from 40.01% in H 10 to 3.74% in H 300 is substantial, the number of improved steps is quite high as the number of nodes in H 300 is much larger than those in H 10 . On the other hand, we also note that systems which have better performance converge to the target much faster because they possess less cumulative redundancy in the convex cone which represents the performance for each approximation. As the performance increases, the relative diversity between said functions must necessary decrease, thereby limiting the available uniqueness that results upon the combinations.

VII. CONCLUDING REMARKS
Combinatorial fusion analysis (CFA) [18], uses the rank-score characteristic (RSC) function [1] and cognitive diversity (CD) [1], [18], [67] to combine MSS with both score combinations (in Euclidean space) and rank combinations (in the bubble-sort Cayley graph based on the symmetric group S n ). By extending the B n to the Kemeny rank space H n which allows tied rankings and using the computationally efficient τ x distance metric [39], [40], [78], [79], multi-layer combinatorial fusion (MCF) provides a robust computational and combinatorial framework on the metric space H n using RSC-function based cognitive diversity. Section III contrasts the RSC function in data science and informatics with the empirical CDF in statistics. Section IV compares CD in data science and informatics with other data correlations in statistics and information diversity in computation. Section V reviews the field of CFA and illustrates two intelligent biometric systems among a variety of domain applications. Section VI provides the multi-layer combinatorial fusion (MCF) framework CLEAR(M ; D; n, t, l) on H n with two simulated examples on H 10 and H 300 and an empirical example on protein-ligand virtual screening and drug discovery in H 300 . In the following, we summarise three distinctive characteristics of the MCF approach (A) and suggest several directions for future work (B).

A. MCF ON THE KEMENY RANK SPACE
Multi-layer combinatorial fusion (MCF) using cognitive diversity has the following distinct features worthy of special attention: 1) MCF uses the combinatorial fusion analysis (CFA) [1], [18], [43], which has the following characteristics: a) It considers a scoring system A as both a score function s A in the Euclidean space R n and the derived rank function r A in the bubble-sort Cayley graph space B n . RSC function was defined to characterise the scoring system A. b) It combines scoring systems A and B in both score and rank combinations. It was shown that under certain conditions (involving cognitive diversity) VOLUME 9, 2021 rank combination can perform better than score combination [38]. c) Cognitive diversity between scoring systems A and B is defined using RSC functions f A and f B . Empirical results on consensus scoring were very useful for improving performance in virtual screening and drug discovery [8]. d) CFA provides combinatorial fusion with (2 t − t − 1 combinations for t original scoring systems) for any combination methods, which are efficient and effective in different domain applications in protein-structure predictions [13], stress identification [20], text categorisation [21], target tracking, robot mapping, and localisation [24], [25], identification of degenerated motif [26], CHIP-Seq peak detection [27], combining visual cognitive systems [28], [29], preference detection using eye movement [30], virtual screening [8], microarray gene expression [32]- [34], portfolio management [35], [36], classifier ensemble [63], [64], and online learning [71].
2) MCF is based on the Kemeny rank space H n which has the following characteristics: a) It is a natural extension from the symmetric group S n and the bubble-sort Cayley graph B n [43], [73]. b) The nodes in H n include rank orders with ties.
This property facilitates advances in the MCF process when ties are the result of combinations. c) It provides a distance metric d H which uses the computationally efficient τ x function. d) The structure of a network can dictate and affect function and application of the network. H n has S n as its sub-network, and B n as its functional subspace. The structure of S n has been well studied [82]- [84]. The bubble-sort Cayley graph B n has many combinatorial properties including combination of mutually independent Hamiltonian cycles [84]. The symmetric group of order n, S n , has many useful combinatorial structures. For example, the connectivity of S 3 is 3 and there are 3 disjoint paths between any pair of nodes in the graph, as shown in Figure 4. See [83] for general study of container width and length in a variety of graphs and groups where a container between two nodes A and B is a set of disjoint paths between A and B.
3) The expansion property of B n : a) Since the Kemeny rank space H n is a metric space which embeds the bubble-sort Cayley B n as a subspace, structure and properties of the graph B n can facilitate the ER-algorithm in the CLEAR(M ; D; n, t, l) framework. In fact, the graph B n has some good expansion properties (see [85] for expander graphs). For example, it was shown [86] that the second eigenvalue of B n , λ 2 (B n ), is at most 1 which is related to certain expansion coefficients.

B. FUTURE WORK
A direction of our future work will focus on applying the CLEAR framework to a variety of domain applications on the Kemeny rank space H n with distance metric d H . In this regard, Heiser and D'Ambrosio obtained results on clustering and prediction of ranking using d H [87]. Beyond improvement in the simulated cases of H 10 and H 300 , Tables 10 and 11, Table 12 exhibits a successful case, in the empirical domain of virtual screening and drug discovery, of the multi-layer combinatorial fusion (MCF) and deep reinforcement learning [12], [43].
Other than the work on various domain applications, we will also work on problems which require multidisciplinary approaches to fundamental methods and intelligent systems, including intelligent biometric systems and computational intelligent systems, blending together the three foundational fields for data science and informatics: statistics, mathematics (including combinatorics and graph theory), as well as computing and informatics (including machine learning and AI) [15], [18], [22], [40], [43], [50], [63], [65], [67], [70], [73], [76], [78], [86]. Recent results by experts from diverse fields have made significant contributions, for example, in the following areas: multi-modal biometric systems using rank level fusion for security systems [5], [6], fusion of deep learning and combinatorics [88], emphasising the fusion of computer hardware and software, global network architecture and web systems, proactive and reactive investigation, and public-private collaboration in mitigating cyber attacks and cyber exploitations [89], calculating the thermodynamic limit for the Mallows model on the bubble-sort Cayley graph B n [90], harnessing fuzzy logic and combinatorial fusion to make network selection simple and effective in the heterogeneous mobile user environment [91], and using model fusion algorithms for neural networks with optimal transport [92]. Equipped with a robust framework such as CLEAR, using multi-layer combinatorial fusion (MCF) and cognitive diversity (CD), on a fundamental metric space such as the Kemeny rank space H n using the distance metric d H [1], [18], [38], [40], [67], [78], [82], more exciting results will be forthcoming to the benefit of secure, healthy, and sustainable societies.