Detecting Associations Based on the Multi-Variable Maximum Information Coefficient

The maximum information coefficient (MIC) is a novel and widely-using measure of association detection in large datasets. The most outstanding feature of MIC is that it has both generality and equability. However, MIC can only deal with two variables and cannot precisely estimate coupling associations of multiple variables. In this paper, we propose an extension of MIC to deal with multi-variable datasets, called the multi-variable maximum information coefficient (MMIC). Some inherited and novel properties of MMIC are proved, including generality, equability, monotonicity, and subadditivity. We design an algorithm based on greedy stepwise strategy and upper confidence bound (UCB) for an approximate calculation of MMIC. The tests of MMIC on generated datasets and examples on real datasets are carried out to detect known and novel associations.


I. INTRODUCTION
In recent decades, data science has been rapidly developed. Large datasets which hold a lot of information have driven a novel idea ''follow the data'' [1]. Association detection is a classic, common, meaningful but still challenging work in the field of data analysis. It is not feasible or profitable to detect associations manually in many large and complex datasets. A relatively reasonable measure is needed to evaluate whether the variables are associated or not, and to what extent they are associated. In many cases, we do not know the type of associations contained in the datasets. The associations may be non-linear, non-monotonic, and cannot be expressed by a mathematical function.
In order to identify these complex associations, the maximum information coefficient (MIC) is proposed [2]. MIC is a measure of association based on mutual information entropy which has both generality and equability. Generality guarantees that MIC can capture varied types of associations. Equability guarantees that MIC can give similar scores to different types of associations with equal noise [3]- [6].
A large number of researchers have done a lot of research on MIC. The research can be classified in three categories: theoretical verification, calculation, and application.
The associate editor coordinating the review of this manuscript and approving it for publication was Paul D. Yoo .
In the research of theoretical verification, MIC is proved that it has both generality and equality [3]- [6]. Moreover, MIC is confirmed to Rényi's axioms [7] and three new axioms when variables are continuous [8].
In the research of calculation, the main efforts are concentrated on improving the calculation accuracy and the reducing the time complexity. The intelligent MIC (iMIC) [9] is proposed for optimizing the partition on the y-axis to get approximate scores of MIC with acceptable accuracy. SuperMIC [10] uses MapReduce framework to improve the calculating efficiency of MIC. The improved algorithm for approximation of MIC (IAMIC) [11], [12] improves the accuracy of MIC by searching more optimal partitions on the y-axis than equipartition. The most commonly used toolkit for MIC is the Maximal Information-based Nonparametric Exploration (MINE) [2] which is compatible with C++, Python, and R. There are also many data independence testing tools based on MIC, including testforDEP [13] for R, MICtools [14] for Python. RapidMic [15] is a cross-platform tool for the rapid calculation of MIC based on parallel technology.
In practical applications, MIC has advantages dealing with complex associations which cannot be measured precisely by classic methods. MIC plays a role in association detection and data driven feature selection in many fields (see Table. 1), including computer science [16]- [19], biology [20]- [22], environmental science [23]- [25], medicine [26], [27], and genetics [28]. Datasets in these fields have a similar feature: there are some associations in the datasets, but the associations are difficult to be expressed by accurate models or mathematical formulas. In the fields of computer science, MIC is used for feature selection [16], [17], model evaluation [19] combined with machine learning methods. Applications in environmental science mainly focus on the analysis of hydrological activities [23], [24] and geology activities [25]. Biology mainly uses MIC on the analysis of neuroscience [20]- [22]. Medicine and genetics explore the associations of genes [28] and diseases [26], [27] based on MIC. MIC is mainly used in feature extraction [29], prediction [30], and estimation [31] based on energy datasets. MIC is also used in the fault diagnosis [32], [33] and prediction [34] of complex systems and equipment. Besides, MIC plays a role in engineering [35], astronomy [36], chemistry [37], sociology [38], optics [39], and agronomy [40]. Overall, MIC is an effective association detection method for feature selection [41], classification [18], and prediction [42].
Although MIC has made great achievements in theory, calculation, and application, there is still a fundamental problem worth expanding research: how to evaluate multi-variable associations based on MIC. The most obvious defect of MIC is that it can only deal with two variables. In many application scenarios, we not only want to detect two-variable associations, but also want to detect associations of multiple variables. Furthermore, multi-variable associations cannot be derived precisely by pairwise combinations of two-variable associations based on MIC. Therefore, an extension of MIC which enables it to deal with multi-variable associations is meaningful.
The contributions of this paper can be summarized as follows: • We design a measure based on MIC, and name it as the multi-variable maximum information coefficient (MMIC). Then we prove generality, equability, and several intuitive multi-variable properties of MMIC.
• We propose a calculation algorithm of MMIC based on greedy stepwise strategy and upper confidence bound (UCB) [43], and analyze the time complexity of the algorithm. In order to verify the accuracy and the performance of the algorithm, we test MMIC on a series of generated datasets with different types of association, relative noises, and dimensions.
• We apply MMIC on real datasets to verify the feasibility and effectiveness of MMIC.

II. RELATED WORKS
Based on MIC, there has been some similar work of multi-variable extension, including the three-variable maximum information coefficient (3MIC) [44] and the bisecting k-means clustering maximum information coefficient (BKM-MIC) [45]. We will brief the ideas of them and analyze their advantages and disadvantages.

A. THE MAXIMUM INFORMATION COEFFICIENT
For a finite dataset D ⊂ R 2 containing two variables, x and y partitioned into dx and dy blocks. If the grid size dx and dy are fixed, we can get the maximum mutual information entropy of the dataset D with a fixed grid G (denoted as I (D| G )) [2].
The mutual information entropy for each grid size is normalized to the range of [0, 1] and put into the counting matrix M .
Based on the traversal of all feasible grid sizes dx and dy, we can get the maximum element in M as MIC. There is an upper limit of grid size for preventing overfitting which is denoted by B (n). The data volume is denoted by n.

B. THE THREE VARIABLE MAXIMUM INFORMATION COEFFICIENT
3MIC adopts a different calculating method of information coefficient. For a finite dataset D ⊂ R 3 with data volume n, each element of counting matrix M with grid size dx, dy, dz is calculated based on mutual and conditional information entropy.
Except the calculation of I (D, dx, dy, dz), the rest of 3MIC's definition and calculation are the same as MIC.
The advantage of 3MIC is that it can be calculated under a relatively low time complexity. However, 3MIC has some defects of mathematical properties.
• 3MIC does not conform to Rényi's axioms [7]. The scores of 3MIC are not always in [0,1]. For example, VOLUME 9, 2021 when x = y, x and z are independent at the meanwhile, I (x; y; z) ≈ I (z) − I (x) − I (y). In this case, 3MIC can be negative.
• 3MIC is not an accurate multi-variable MIC, but a roughly estimating algorithm based on MIC of each two variables. Besides, 3MIC is not proved to have generality and equality.

C. THE BISECTING K-MEANS MAXIMUM INFORMATION COEFFICIENT
BKM-MIC is a multi-variable MIC calculating method based on a kind of dimensional greedy stepwise strategy. The core idea of BKM-MIC can be concluded as: (1) In the equations, dX is a fixed value decided by G X .
BKM-MIC is compatible with MIC and ranges from 0 to 1. However, BKM-MIC still has some flaws to be improved: • The newly added variables cannot directly affect the already fixed grid. The coupling associations may not be fully detected.
• Premature fixation of grid size makes it difficult to capture data feature adequately.

III. THE MULTI-VARIABLE MAXIMUM INFORMATION COEFFICIENT
To guarantee the integrity and rigor of MIC for multiple variables and improve the rationality and feasibility, we propose MMIC. In this section, we will discuss MMIC from three aspects: definition, properties, and calculation.

A. DEFINITION OF THE MULTI-VARIABLE MAXIMUM INFORMATION COEFFICIENT
The definition of MMIC can be extended from MIC fundamentally. Given a finite dataset D ⊂ R m , variables in D are divided into two groups, X and Y. MMIC is the information coefficient of the ''best'' grid. The information coefficient is the normalized mutual information entropy decided by the count of samples in each box. The ''best'' means that the grid is corresponding to the maximum normalized mutual information entropy. The number of columns in X and Y can be arbitrary, denoted as m x and m y , m = m x + m y . When m x = m y = 1, MMIC can give the same score as MIC to keep compatibility with MIC. The i-th X or Y variable is partitioned to dx i or dy i parts, and the size of grid G is denoted as dG = The definition of MMIC can be summarized as follows.
The mutual information entropy I * (D, dX , dY ) is normalized and stored in counting matrix M .
For example, MIC can be regarded as the ''best'' grid of two-dimensional space like Fig. 1, and MMIC of three variables can be regarded the ''best'' grid of three-dimensional space like Fig. 2.

B. PROPERTIES OF THE MULTI-VARIABLE MAXIMUM INFORMATION COEFFICIENT 1) PROPERTIES INHERITED FROM THE MAXIMUM INFORMATION COEFFICIENT
MMIC reserves all good mathematical properties of MIC, including symmetry, normalized range, generality, and equability.
Since MMIC is based on mutual information, the symmetry of mutual information entropy guarantees the symmetry of MMIC. Therefore, the score of MMIC is not affected by the order of variables.  MMIC tends to be 1 when at least one variable y in Y keeps a noiseless association of X, and at least one variable x in X keeps a noiseless association of Y. MMIC tends to 0 when each variable y in Y and each variable x in X are independent with sufficient samples. The premise, sufficient samples, is set to prevent random associations due to the effect of small sample. This premise is more prominent when the dimension of data is higher. Higher dimension brings more boxes partitioned by the grid and fewer points in each box. Therefore, the data samples are diluted by dimension, and MMIC is more likely to capture some random associations.
• Generality and equability. We will prove that MMIC can give similar scores for different types of association with equal noises by experimental verification. Besides, we will discuss the relationship between MMIC and R 2 . MMIC is roughly equal to R 2 when the associations can be expressed by mathematical functions (see experiments and results in Section IV).

2) PROPERTIES FOR MULTIPLE VARIABLES
Besides the mathematical properties inherited from MIC listed above, MMIC also has some intuitive properties for multiple variables, including monotonicity and subadditivity. We will prove these two properties as follows.
Proof: More information usually leads to better choices. Monotonicity can be deducted based on the definition of MMIC.
Denote the partitioning grid of MMIC (X 1 , Y) by G 1 = G X 1 , G Y . G X 1 is the grid of X 1 , and G Y is the grid of Y.
Presume that X 2 is not partitioned. Therefore, G X 2 is empty. Y), the scores of all other grids are smaller than it, including the grid G 1 . Therefore, Proof: MMIC can be regarded as an information extracting method based on mutual information entropy. Mutual information entropy conforms to the Jensen's Inequation Y). Therefore, MMIC keeps subadditivity of mutual information entropy. The equal sign is satisfied only if X 1 and X 2 are dependent.

C. CALCULATION OF MULTI-VARIABLE MAXIMUM INFORMATION COEFFICIENT
The most challenging work of MMIC is not the theoretical derivation, but the calculation. The calculation of MIC is time-consuming when dealing with complex associations and large amounts of data samples. Complex associations lead to more different partitioning grids which needed to be calculated and compared. Large amounts of data samples lead to more time cost of information entropy calculation of each grid. MMIC introduces the third factor of time complexity: the dimension. Dimensional explosion heavily exacerbates the disaster of time complexity.

1) CALCULATION STRATEGIES
Almost all calculating methods of MIC are approximate algorithms balancing accuracy and time complexity. Like MIC, calculating the exact value of MMIC is not feasible in most cases. In order to weigh accuracy and time complexity, we design and implement three calculation strategies: limitation of grid, greedy stepwise strategy and upper confidence bound (UCB).

a: LIMITATION OF Grid
The first strategy is setting reasonable limitations of the maximum grid size and the potential partitioning positions. The maximum grid size can be interpreted as the model capacity of association. In theory, MMIC can detect associations of any complexity. However, in practice, it can only accurately measure associations below the model capacity. This limitation is not only designed to reduce the time complexity of calculation, but also designed to prevent recognizing associations caused by coincidence of small sample. Different from the limitation of the maximum grid size, the limitation of potential partitioning positions is merely intended for simplicity of calculation. In tests, we find that the impact on calculation accuracy caused by the limitation of the grid can be ignored, and most of the interesting and common associations can be measured with tolerable error.

b: GREEDY STEPWISE STRATEGY
The second strategy is a kind of greedy stepwise strategy including two steps: initialization and optimization. In initialization, a grid is generated based on an initial strategy. In most VOLUME 9, 2021 cases, grids based on uniform partition and two-variable MIC can be expedient choices. After obtaining the initial grid, the grid is adjusted to get the mutual information entropy maximum by a kind of neighborhood searching strategy: finding the maximum positive gain I (D|G * ) − I (D|G) and replacing G with G * .

c: UPPER CONFIDENCE BOUND
The third strategy is UCB. UCB is a searching strategy which can balance exploration and exploitation [43]. The calculation of MMIC is a classic problem of solution searching: selecting optimal combinations from candidate items under certain conditions. In the calculation of MMIC, we not only need to select the partitioning position, but also need to select the partitioning variable. Therefore, the selection based on UCB maintains an average gain for each variable as: In variable selection, the average gainμ i and the explorations times T i are weighted by UCB.
According to our test, η = 0.1 is appropriate in most cases. For the calculation of MMIC, UCB has two advantages: • UCB can reduce the randomness of the calculation, and enhance the robustness in the grid searching.
• UCB can not only give the value, but also give the confidence.

2) ALGORITHM AND TIME COMPLEXITY
Based on the strategies listed above, to balance calculation accuracy and time complexity of MMIC, there are four adjustable parameters according to data dimension and volume in Algorithm. 1: α, β, γ and θ. α is a parameter inherited from MIC [2] to control the maximum grid size. β is designed for controlling the potential partitioning position. Besides, γ and θ are designed for controlling the number of iterations corresponding to max_iter(global) and max_iter(local) in pseudo code respectively. To get rid of the influence of data type and programming language, we define that the complexity of calculating information coefficient of a fixed grid is 1. When the amount of data samples is n and the dimension of data samples is m, based on the maximum grid size [2] and the potential partitioning position, the number of feasible grid size is approximately m α · 2 n . The amounts of searching space for each global iteration and local iteration are βlog 2 n and θ γ m respectively. Therefore, for each feasible grid size, there are about βγ θmlog 2 n grids being compared. Overall, the theoretical time complexity of Algorithm 1 is approximately O m α · 2 n · βγ θmlog 2 n (see Fig. 3).

Algorithm 1 Calculation of MMIC Based on Greedy
Stepwise Strategy and UCB Input: dataset D, dividing to variables X and Y Output: MMIC, best grid for feasible grid sizes dG do Initialize uniform grid G with equal density for iter(global) from 1 to max_iter(global) do Select variable V i from X and Y by UCB (see (14) and (15)) for iter(local) from 1 to max_iter(local) do for all G from G and neighborhood(G) do G * ← arg max

A. TEST ON GENERATED DATASETS
To prove that MMIC is as effective as MIC, we generate datasets based on combining function with different dimensions, data volumes, and relative noises. The datasets contain non-linear, non-monotonic associations (see Table. 2). All the programs are running on a laptop which contains an Intel Core i7-8750H CPU and 16GB memory. The algorithm is coded in pure Python and posted on https://github.com/GuTaoyong/The-Multi-Variable-Maximum-Information-Coefficient. The default values of controlling parameters are α = 0.6, β = 2, γ = 1, and θ = 1.  In the tests, the calculation accuracy of MMIC is satisfactory. The scores of MMIC of noiseless functions are near 1, very close to that of MIC. The worst result is 0.92 which is acceptable with the limited computational resources (see Table. 3). More detailed results can be seen in Table. 7 and 8 in Section V. The association detection ability of MMIC can be described as equivalent to MIC.
In terms of calculating efficiency, when the data volume is 200, MMIC of two-variable, three-variable, and four-variable data can be calculated in less than 20 seconds, one minute, and five minutes respectively. It is proved that even on a lightweight computing device, the calculation of MMIC is still feasible.
In addition, we compare the results of MMIC with the goodness of fit (R 2 ). Some conclusion can be drawn from the tests: • MMIC is roughly equal to R 2 , and the changing trends of MMIC and R 2 are consistent (see Fig. 4). This means that MMIC can do roughly well as R 2 without the preset of association types.
• In most two-variable cases, MMIC is less than R 2 . It is consistent with the conclusion that R 2 is the approximate theoretical upper bound of MIC [2]. However, in some three and four-variable, or highly noisy two-variable cases, this relationship may change (see Fig. 4). This is an overfitting and underfitting dilemma in nonparametric association detection, including MIC and MMIC. Unlike R 2 , we do not preset the type of associations in the calculation of MMIC. Therefore, MMIC may detect random associations when the grids are too dense, and miss associations when the grids are too sparse.
These two conclusions of MMIC are also consistent with MIC. Therefore, the inheritance between MMIC and MIC is further verified. More detailed experimental results are presented in section V.

B. EXAMPLES ON REAL DATASETS
Besides the tests on generating datasets, we also use MMIC to detect known and unknown associations based on real datasets, including datasets from the World Health Organization (WHO) [46] and the National Climatic Data Center (NCDC) [47].

1) EXAMPLES ON WHO DATASET
The WHO dataset contains 355 indicators of 202 countries and regions. The purpose of the examples on the WHO dataset is to get the most relevant indicator sets of a certain indicator. Besides the scores of MMIC, we can also get the partitioning grids of MMIC which may have some reference value.
Population growth rate is an issue of widespread concern and frequent discussion. Therefore, we use MMIC to analyze the associations of population growth rate and other indicators based on the dataset.
Firstly, we calculate MMIC of each two variables to determine the range of MMIC and roughly judge whether the variables are associated or not. Population annual growth rate is selected as Y. According to the monotonicity and subadditivity of MMIC, variables with larger MMIC are given preference in the selection of X. The calculation of two-variable MMIC shows that, besides the other population indicators (population growth and urban population growth), the most relevant indicators are population living below the poverty line, children per woman, contraceptive prevalence, and under-5 mortality rate (see Table. 4). VOLUME 9, 2021    Then we select X from these four variables in pairs, and set population annual growth rate as Y to calculate MMIC. The results are shown in Table. 5. Compared with R 2 , Pearson, Spearman, and Kendall correlation coefficient of linear regression, MMIC can give higher scores for non-linear and non-monotonic associations.
For example, the most relevant two-variable combination is population living below the poverty line and children per woman. In the grid of MMIC of these indicators, the partitioning points are 3.7, 10.5, 14.8, 36.1, and 57.8 for population living below the poverty, 1.94, 2.31, and 3.08 for children per woman, 1.7 for population annual growth rate. As shown in Fig. 5 and 6, the partitioning points of the grid are located in

2) EXAMPLES ON METEOROLOGICAL DATASETS
The meteorological datasets are collected from 18 observation points. Our analysis focuses on the associations of temperature, wind, sky coverage, and liquid precipitation. The purpose of the examples of MMIC on meteorological datasets is to evaluate the strength and non-linearity of the associations of the indicators.
We use MMIC − R 2 to evaluate the non-linearity of associations. R 2 denotes the goodness of fit of linear regression. This measure corresponds to the non-linearity measure of MIC: MIC − ρ 2 , where ρ denotes the Pearson correlation coefficient [2]. MMIC − R 2 is near 0 for linear associations and large for non-linear associations with high scores of MMIC. After examining all associations, we find there are 26.2% (88/335) of them are non-linear (MMIC − R 2 > 0.2).
We select three typical points in Fig. 7 for illustration. Point A, B, and C represent strong linear (dew point temperature, wind direction, and air temperature), non-linear (sky condition, air temperature, and liquid precipitation), and weak non-linear association (sea level pressure, air temperature, and liquid precipitation) respectively (see Fig. 8).

V. CONCLUSION
In this paper, we propose a multi-variable extension of the maximum information coefficient, and name it as the multi-variable maximum information coefficient (MMIC). We prove that MMIC inherits all good properties of MIC including generality and equability. Besides, MMIC is also proven to have some intuitive properties of multiple variables like monotonicity and subadditivity. For the calculation of MMIC, we design an algorithm based on greedy stepwise strategy and UCB and analyze the time complexity of the algorithm. Based on generated and real datasets, we illustrate the rationality and feasibility of MMIC. MMIC can detect non-linear and non-monotonic associations of multiple variables which may be helpful for large-scale association analysis.
MMIC has some room for further research. For example, distinguishing the coincidence and real associations and mining causality through associations by MMIC are two meaningful tasks. Some heuristic searching and reinforce learning strategies can be introduced to improve the efficiency of calculation by pre-evaluating the grids. A combination of heuristic high-layer strategy and greedy or brute force lower-layer strategy may be a good choice for the calculation of MMIC.

APPENDIX
In this section, the detailed results are demonstrated, including the time complexity of Algorithm. 1 in Table. 6, the relation of relative noise, MMIC, and R 2 in Table. 7 and  Table. 8, and the most relevant indicators of population annual growth rate in WHO dataset evaluated by two-variable MMIC in Table. 9.