NNR-GL: A Measure to Detect Co-Nonlinearity Based on Neural Network Regression Regularized by Group Lasso

For finding keys to understand and elucidate a phenomenon, it is essential to detect dependences among variables, and so measures for that have been proposed. Correlation coefficient and its variants are most common, but they only detect a linear dependence (co-linearity) between two variables. Some recent measures can detect a nonlinear dependence (co-nonlinearity) by means of kernelization or segmentation. They are supposed to handle two variables only and open to discussion with regard to performance in detection and difficulty in setup. There is room for a novel measure based on Neural Networks (NNs), since usual NNs aim at prediction but not at variable dependence detection. For the high-performance detection of co-nonlinearities among multi variables, we propose a measure called NNR-GL based on Neural Network Regression (NNR) regularized by Group Lasso (GL). NNR-GL embodies the detection through multi-input single-output regression by NNR and regularization on the input layer by GL. NNR-GL then calculates how strong the detected co-nonlinearities are by unifying the regression performance and the weights on input variables. We conducted experiments using artificial data to examine the behaviors and fundamental effectiveness of NNR-GL. The performance was estimated by a comprehensive detection performance criterion (CDP-AUC in short), which is the mean of area under curves representing true positive and true negative detections. NNR-GL achieved the values of CDP-AUC from 0.7472 to 0.9681, where 0 means complete failure and 1 means complete success in detection. These values were consistently higher than those from 0.5972 to 0.9259 of the conventional measures for all the different conditions of dependence, data size, and noise rate. Consequently, the effectiveness and robustness of NNR-GL were clearly confirmed.


I. INTRODUCTION
In this study, we propose and evaluate a novel neuralnetwork-based measure that detects nonlinear dependences among multi variables. 1 Two backgrounds motivated us, where one is of variable dependence detection, and the other The associate editor coordinating the review of this manuscript and approving it for publication was Gianmaria Silvello . 1 ''dependence'' not ''dependency'' is used following the use of ''dependence'' as a technical term in the related papers. ''dependences'' is used intentionally to mean multiple relationships among multiple variables. is of machine learning and knowledge discovery. Section I provides the first background, the second background, and the motivation and objective in Sections I-A, I-B, and I-C, respectively.

A. BACKGROUND OF VARIABLE DEPENDENCE DETECTION
Detecting dependences among variables is a common issue in various fields as the first step toward understanding and elucidating phenomena. Measures for that have been pro-  posed and applied to scientific and engineering disciplines. We actually found a large number of such applications by a survey for only the last 5 years, some of which are about the following: Medicine on cancer, Alzheimer's disease, and Covid-19 [1]- [3]. Brain science and engineering using electroencephalogram, electromyogram, and so on [4], [5]. Genomics and proteomics on emergent properties, clonal fate, and protein types [6]- [8]. Chemistry, physics, and material science on laser devices, ion energy, and photoelectrochemical power [9]- [11]. Environmental and earth science on climate and ocean dynamics [12]- [14]. Automotive and transportation engineering including self-driving techniques [15]- [17]. Once dependence is detected using a measure, it can become new knowledge on the focal phenomenon. It can also contribute to the selection of important variables and the improvement of regression/classification performance. Steps in variable dependence detection are illustrated in Fig. 1. In the first step, potential dependences among a wide variety of variables are detected that give hints for new hypotheses on a target phenomenon. In the second step, these dependences are narrowed down and brushed up to promising dependences based on knowledge specific to the domain. In the third step, the promising dependences are interpreted and formulated rigorously on the basis of domain knowledge. This result becomes newly established knowledge on the phenomenon. A traditional way starts from the second or third step for limited numbers and types of variables, assuming a wealth of domain knowledge. This way is important, of course, but will not be sufficient for perceiving unexpected dependences behind various variables. We hence consider a measure for the first step to detect potential dependences that are difficult to perceive only with known domain knowledge.
As an example of the stepwise process, suppose clinical data accumulated in hospitals. Applying a measure in the first step enables to detect potential dependences related to health risks from variables including patients' environments, lifestyles, clinical examinations, diagnoses, and treatments. That provides awareness of dependences hardly noticeable with known medical knowledge only. The potential dependences can be analyzed and formulated in the second and third steps to reach novel medical knowledge. Another example is car driving support, where sensor data on users, devices, and environments are available. In the first step, a measure detects potential dependences causing traffic accidents. Their analysis and formulation lead to new knowledge on safe driving in the second and third steps. The same applies to other fields.
Most of the conventional measures aim at the detection of a dependence between two variables. Correlation coefficient and its variants are most common, but because of their assumption of a linear dependence (co-linearity), they cannot detect a nonlinear dependence (co-nonlinearity) [18]- [21]. In recent years, measures able to detect a co-nonlinearity have been proposed. Some of them are based on mapping and correlation coefficient/covariance [22]- [26]. The others are based on segmentation and mutual information [27], [28]. Their abilities in expressing a co-nonlinearity and difficulties in setting are still controversial. Both the linear and nonlinear measures cannot directly detect dependences among more than two variables.

B. BACKGROUND OF MACHINE LEARNING AND KNOWLEDGE DISCOVERY
Looking at Machine Learning (ML) and its use for finding new knowledge i.e. Knowledge Discovery (KD), they share technical features but pursue different interests as in Table 1. For the interest of ML, Neural Networks (NNs) including deep learning attract great attention and achieve high performances in nonlinear prediction. Such NNs implicitly model dependences among variables, but do not explicitly provide the modeled dependences. In terms of variable dependence detection, it would be worth utilizing NNs for the interest of KD from the standpoint of awareness.
As far as we surveyed, surprisingly we did not find NNs that directly aim at assisting awareness except a few related work [29]. Specifically, NN-based measures representing how strong variable dependences were not found, despite the high potential of NNs in nonlinear modeling. The main reason would be that NNs are a black box mixing input variables up in a way not understandable in the domain context (what is called ''implicit distributed representation''). In our past research collaboration with medical experts, they wanted to know clear relationships of original variables that were clinical test results having medical meanings. They did not want to get into the features transformed inside of NNs that were difficult to understand medically. This episode is just our experience but gives a suggestion that focusing on original variables (equivalently, the input layer of a NN) is essential for awareness and understanding.

C. MOTIVATION AND OBJECTIVE
The background in Section I-A indicates the need for a novel measure to detect nonlinear dependences among multi variables in the first step in Fig. 1. That in Section I-B indicates the high but not demonstrated potential of NNs in variable dependence detection and the way to demonstrate it for the awareness in KD in Table 1.
These indications encourage us to propose a NN-based measure called NNR-GL, in which Neural Network Regression (NNR) [30], [31] models co-nonlinearities among multiinput single-output variables, and Group Lasso (GL) [32], [33] selects contributable input variables by the regularization on the NNR's input layer. Thanks to GL on the input layer only, NNR-GL provides explicit localized representation on which input and output variables are dependent and accepts any kinds of NNRs. As the measure value, NNR-GL outputs a quantity representing the strength of input-output co-nonlinearity by unifying the performance of regression and the weight on each input variable. To analytically evaluate NNR-GL, we conduct experiments in comparison with the conventional measures using artificial datasets with known correct dependences. NNR-GL, of which rough idea partially appeared in our past study [34], is now thoroughly proposed and evaluated in our present study.
The main contributions of this paper are summarized as the following #1, #2, and #3. #1: From a broad perspective, the detection of unexpected complex dependences behind various variables is necessary as the first step to clarifying phenomena. Our study, which lies in the interdiscipline between ML/KD and other sciences and engineering, provides people in these areas a novel way to achieve this first step. Furthermore, the study expands the utilization of NNs from prediction in place of humans to dependence detection for human awareness.
#2: As a concrete solution, our proposed measure NNR-GL makes the detection of multi nonlinear dependences possible in an accurate and robust manner. The ideas of NNR-GL (namely, nonlinear modeling by NNR, variable selection by GL, the way to quantify detected dependences, and robustness by averaging) are simple but applicable to various NNs.
#3: The experimental results in our study ensure the effectiveness of NNR-GL. They are helpful as the baseline performances of NNR-GL and the conventional measures when one wants to use these measures for his/her task. Unlike the past studies, our study introduces an evaluation methodology that is quantitative and both-sided for true positives and true negatives. This methodology can be used in future related work.
In this paper, Section I provides the background and objective as above. Section II reviews conventional measures for variable dependence detection and their remaining problems. Section III proposes the novel measure NNR-GL to detect nonlinear dependences among multi variables. Section IV designs an evaluation experiment to analytically examine the fundamental effectiveness of NNR-GL with artificial datasets. Section V reports the experiment and discusses the performances of NNR-GL and the conventional measures. As a stepping stone toward practicality, Section VI reports a pilot experiment to apply NNR-GL to real benchmark datasets. Finally, Section VII concludes the paper and gives some directions for future work.

II. CONVENTIONAL MEASURES
Our proposed measure NNR-GL is for the purpose of detecting nonlinear dependences among variables by introducing machine learning techniques NNR and GL. Thus, in Section II on related work, variable dependence is defined at first in Section II-A. Next, the conventional measures for variable dependence detection are reviewed in Section II-B. Finally, NNs, NNR, Lasso, GL, and their combinations are reviewed in Section II-C.

A. DEFINITION OF VARIABLE DEPENDENCE
In general, the independence between two random variables X and Y is defined as Equ. (1) using the joint probability mass or density function p XY (x, y), the marginal probability mass or density function of X p X (x), and that of Y p Y (y) [35], [36]. There is another definition Equ. (2) based on Fourier transform, i.e. the characteristic functions c XY (s, t), c X (s), and c Y (t). The greater the difference between the left and right sides of Eqs. (1) or (2), the stronger the dependence.
The measures of a dependence between two variables can be categorized viewing from three aspects. The first aspect is the definition of dependence, whether it is based on probabilities or characteristic functions. The second aspect is the formulation of dependence, how the left and right side difference is described mathematically such as the norm of subtraction and the logarithm of ratio. The third aspect is the assumption of dependence, what shape is assumed for the dependence function, that is linear or nonlinear. Generally, it is emphasized to unlock hidden complex dependences, and so we discuss conventional measures as to the third aspect.  and its variants. Those based on NNs are not found so far, and thus we propose a NN-based measure later in Section III. CC, DCC, HSIC, and MIC are explained hereinafter in this Section II-B. Note that their definitions for the population X and Y are skipped, but the definitions for the sample x and y are provided. In Section II-C, the existing NNs regularized by GL are discussed, which are not measures but technically related to NNR-GL.

1) LINEAR MEASURES
CC is the most basic measure that assumes a linear dependence between two variables [18]- [21]. It estimates the difference between the left and right sides of Equ. (1), from the viewpoint of how the variables co-vary around a linear line. Equ. (3) formulates the definition of CC, where x and y are variables. x i is the ith observation of x, x is the sample mean of x, and the same applies to y i and y. N is the number of sample points. COV smp (x, y) is the sample covariance of x and y. CC ranges from −1 to +1, and its larger absolute value means a stronger dependence. CC and its variants cannot detect nonlinear dependences and which variable causes which one.

2) NONLINEAR MEASURES BASED ON MAPPING
Distance Correlation Coefficient (DCC) is a measure based on characteristic functions [22], [23]. It calculates the L 2 norm distance between the left and right sides of Equ. (2) for each of x and y, x and x, and y and y. DCC then outputs the normalized distance of x and y as the quantity representing dependence. In Equ. (4), c xy N (s, t), c x N (s), and c y N (t) are the characteristic functions. DIST smp (x, y) is the distance estimated with the N sample points. DCC ranges from 0 to +1, and its larger value means a stronger dependence.
It can detect co-nonlinearity with no assumption, but cannot explicitly specify the family of nonlinear functions.
Another mapping-based measure is Hilbert Schmidt Independence Criterion (HSIC) [24]- [26]. It detects a nonlinear dependence between two variables via kernelization. The difference between the left and right sides of Equ. (1) is estimated by the covariance in the reproducing kernel Hilbert space. HSIC is defined as Equ. (5), where φ(x, θ φ ) is the mapping function of x with its hyperparameter θ φ , and ψ(y, θ ψ ) is that of y with its hyperparameter θ ψ . The actual calculation is done using the inner products of the mapping functions or the kernel functions k(x, x ) and l(y, y ). The range of HSIC depends on kernel functions and hyperparameters, but can be normalized to −1 to +1. The larger the absolute value of HSIC, the stronger the dependence. Although HSIC is able to express various families of nonlinear functions, it matters how to select kernel functions and set their hyperparameters.

3) NONLINEAR MEASURES BASED ON SEGMENTATION
Maximal Information Coefficient (MIC) performs segmentation of an original variable space [27], [28]. It then detects a nonlinear dependence by accumulating the mutual information between two variables over all the segments. Seeing Equ. (6), MIC can be understood to be the mean of piecewise log ratios of the left and right sides of Equ. (1). It has hyperparameters determining the maximal grid size and the maximal segment size. The parameters to optimize via training are the kth segmentation pattern for x and the lth segmentation pattern for y. MI smp (SEG k , SEG l ) is the sample mutual information when the sets of segments are SEG k and SEG l . p XY ,kl (x, y), p X ,k (x), and p Y ,l (y) are the probabilities estimated based on the ratio of sample points in SEG k and SEG l . MIC(x, y) is the maximal mutual information obtained with the optimal sets of segments SEG k * and SEG l * . MIC ranges from 0 to +1, where the larger value means stronger dependence. It can express any families of nonlinear functions representing a dependence, but may cause overfitting if all the data are used for segmentation pattern search. The advantage that MIC detects any of one-to-one, one-to-many, many-to-one, and many-to-many relationships backfires in identifying which variable causes which one. The versatility of MIC is said to be controversial [37].
The measures CC, DCC, HSIC, and MIC are established basic ones. Consequently, recent related studies are on the analysis, assistance, expansion, and utilization of these measures. There are studies that analyzed the characteristics and behaviors of the measures theoretically and/or empirically [38]- [41]. For the assistance of the measures, other studies proposed algorithms to accelerate measure value calculation [42], [43]. Other studies aim at expanding and utilizing the measures in a general manner. They proposed methods based on the measures for the following: statistical tests [44]- [46], robustness improvement [47]- [49], algorithmic strategies for multi dependence detection [50], [51], feature selection [52]- [56], and feature extraction [57], [58]. The other studies are to utilize the measures in specific domains, the applications in short, which are as given in Section I-A. As suggested by the many related studies, CC, DCC, HSIC, and MIC are still the gold standards and so should be the competitors to our proposed measure.

C. NEURAL NETWORKS WITH GROUP LASSO
There is a superficially subtle yet fundamental distinction between NNs for prediction and NN-based measures for variable dependence detection. NNs with Lasso have been proposed in the former sense but not in the latter sense that we pursue. However, they are discussed below because of their technical aspects common to our study. The simple L 1 regularization Lasso is not efficient to selectively strengthen important neurons, since Lasso adjusts weights on edges regardless which edges are connected to which neurons. GL can solve this problem by grouping edges connected to the same neuron and adjusting the weights groupwise. Hence, NNs with GL attract attentions as to sparse modeling and variable selection aiming at better prediction.
For sparse modeling in image recognition, Zhao et al. [59] proposed a Deep NN (DNN) classifier that consists of sub-networks representing modalities. The weights in each sub-network are grouped and regularized by GL. Similar attempts were done in the literatures [60]- [62]. In speech recognition, Ochiai et al. [63] proposed a hybrid of DNN and Hidden Markov Model, where the weights corresponding to each neuron in hidden layers are grouped and regularized by GL. These methods successfully made the DNNs sparse and improved the performances. The effectiveness of NNs with GL was ensured, but their purpose and way (prediction by a NN with GL on all the layers) differ from ours (variable dependence detection utilizing a NN with GL on the input layer).
For variable selection, there are studies to regularize the input layer of a NN with GL. They used ''feature selection'' to refer to selecting input variables, despite features represented by hidden layers were out of their scope. That is confusing, and so we use ''variable selection'' instead. Li et al. [64] employed GL and L 1/2 regularization on the input layers of multilayer feedforward NNs including NNR and Autoencoder (AE). Han et al. [65] made a similar attempt regarding AE. Zhang et al. [66] proposed a smooth differentiable GL on the input layer for efficient training. These methods remove input variables independent of the output variable under the fixed input and output variables. If a strong dependence exists among input variables, a part of them are randomly selected as their representatives by GL. We call this problem ''multi-co-nonlinearity'' (or ''multi-col-nonlinearity'') named after multicolinearity (or multicollinearity) [18], [19]. The reproducibility of such selection might be open to question. As is obvious, the methods do not concern detecting nonlinear dependences; they are not co-nonlinearity measures.

III. PROPOSAL OF NNR-GL
The background in Section I and the related work in Section II motivated and led us to propose NNR-GL. In Section III, we concretize NNR-GL by providing the underlying ideas and algorithm design in Section III-A and the formulation and definition in Section III-B. Additionally, we answer potential questions that the readers may have about NNR-GL in Section III-C.

A. UNDERLYING IDEAS AND ALGORITHM DESIGN
Let us begin with the summary of what is suggested in Section II. Aiming at variable dependence detection, there are linear and nonlinear (mapping-based and segmentationbased) measures, but no NN-based ones. The past successes of NNs with GL in prediction encourage us to devise a novel NN-based measure. The core of the measure should be a NNR with GL regularizing the input layer. Unlike the   for j = 1 to N z do 4: for k = 1 to N rnd do 5: Under hs, train NNR-GL Core initialized with k to regress the jth variable z j on z 1 , · · · , z j−1 , z j+1 , · · · , z N z using D tr . 6: Validate the trained NNR-GL Core using D va . 7: Record the best hyperparameter setting bhs and the best parameter setting bps for j and k. 8: end for 9: end for 10: end for 11: % Test NNR-GL Core for each output variable and initialization. 12: for j = 1 to N z do 13: for k = 1 to N rnd do 14: Under the bhs and bps corresponding to j and k, apply the trained and validated NNR-GL Core to regress the jth variable z j on z 1 , · · · , z j−1 , z j+1 , · · · , z N z using D te . 15: Extract RCs and RSS from the trained, validated, and tested NNR-GL Core. 16: end for 17: end for 18: Output the sets of RCs and RSS for all j and k.
existing NNs with GL, the measure should exchange input and output variables. Motivated by the above, we propose a NN-based measure called NNR-GL that detects nonlinear dependences among multi variables with high performance. We design NNR-GL to have Function 1 (iterative nonlinear regression) and Function 2 (measure value calculation) as shown in Fig. 2.
NNR-GL Core plays the main role here. In training and validation phases, NNR-GL Core models nonlinear dependences between multi input variables and a single output variable. It makes a contrast of the weights or the regression coefficients (RCs) on input variables according to their contributions to regression. In a test phase, the trained and validated NNR-GL Core estimates the regression performance or the residual sum of squares (RSS). This process is done individually to each variable as the output. If high reproducibility is required, regression is repeated and averaged over different initializations expecting a kind of ensemble effect. Consequently, the sets of RCs and RSS are obtained for all the different output variables and initializations. for m = 1 to N z do 4: for k = 1 to N rnd do 5: if j == m then Set the quantity qntty jmk , which represents the dependence between the jth and mth variables under the kth initialization, to 1. Calculate qntty jmk by unifying the RC and RSS corresponding to j, m, and k. 8: Overwrite the smaller with the larger out of qntty jmk and qntty mjk . Calculate the mean of qntty jmk over the N rnd initializations and set qntty jm to this mean. 12: end for 13: end for 14: Output the measure value matrix MV te consisting of qntty jm for j, m = 1, 2, · · · , N z .
There are two reasons why changing the output variable. One reason is to solve the multi-co-nonlinearity problem, which was discussed at the end of Section II. Executing NNR-GL Core for different output variables reveals dependences possibly masked by multi-co-nonlinearity. Regarding the other reason, suppose x 3 = f (x 1 )+g(x 2 ). The dependence between x 3 and x 1 and that between x 3 and x 2 are detectable when x 3 is regressed on x 1 and x 2 . Meanwhile, these dependences are less or not detectable when x 1 or x 2 is regressed on the others, because the inverse mapping of nonlinear f (x 1 ) and g(x 2 ) is one-to-many. To solve this problem, NNR-GL runs NNR-GL Core for different output variables and reveals dependences possibly masked by inverse mapping. In other words, NNR-GL takes into account which variable causes which one. In the example, the dependence between x 3 and x 1 is adopted when x 3 is regressed, but it is discarded when x 1 is regressed.
Function 1 is embodied as in the pseudocode. It receives training, validation, and test sets D tr , D va , and D te and returns the sets of RCs and RSS. For the hyperparameter setting hs, the jth output variable, and the kth initialization, NNR-GL Core gets through training using D tr and validation using D va (Lines 1 to 10). The best hyperparameter setting bhs and the best parameter setting bps are obtained for each output variable with each initialization. NNR-GL Core with the settings bhs and bps moves onto test using D te . It models dependences between the output variable z j and the input variables z 1 to z N z except z j (Lines 11 to 17). In the end, Function 1 outputs the sets of RCs and RSS for all the output variables and initializations.

2) FUNCTION 2
Given the sets of RCs and RSS from Function 1, NNR-GL unifies the RC and RSS for each pair of an input variable and an output one into a quantity representing their co-nonlinearity. This pairing clarifies each input-output cononlinearity, even if there exist multi-co-nonlinearities in input variables. The quantity for variables x and x is obtained both when x is regressed and when x is regressed. Out of such quantities, NNR-GL picks up the larger one representing the dependence not in inverse mapping but in forward mapping. For reproducibility, quantities under the same condition are averaged over their different initializations. NNR-GL then outputs the sets of quantities as the measure values of all the dependences.
As in the pseudocode, Function 2 receives the sets of RCs and RSS and returns the measure value matrix. The RC and RSS, which were obtained when the jth variable was output and the mth variable was one of inputs under the kth initialization, are unified into the quantity qntty jmk (Lines 1 to 13). This quantity represents the co-nonlinearity generalized for D te between the jth and mth variables. This calculation includes selecting the larger one of qntty jmk and qntty mjk and averaging it over initializations. Finally, Function 2 outputs the measure value matrix MV te of which element is qntty jm for all the combinations of variables.

B. FORMULATION AND DEFINITION
We formulate the model structure, objective function, and optimization of NNR-GL Core in Function 1. NNR-GL Core accepts any types of NNs for regression, but a Multilayer Perceptron (MLP) is used for simplicity. NNR-GL Core has N L + 1 layers, where the 0th is the input layer, the 1st to (N L − 1)th are the hidden layers, and the N L th is the output layer. In the input layer, there are N z neurons corresponding to input variables z 1 , z 2 , · · · , z N z . In Fig.2, the regression target variable z j is excluded from the input layer for ease of understanding. However, z j is included here for a general formulation, and its weight is fixed to 0 to be excluded parametrically. In the lth layer, there are N mn is the weight on the edge from the mth neuron in the (l − 1)th layer to the nth neuron in the lth layer. W (l) given in Equ. (7) is the weight matrix containing w (l) mn . W, which gathers W (l) over 1 ≤ l ≤ N L , is the weight matrix to set by training. As mentioned above, w (1) jn is fixed to 0. The numbers of layers and neurons are hyperparameters to set by validation.
The objective function of NNR-GL Core is defined in Equ. (8), which is composed of the terms as to regression performance and variable selection. λ is a hyperparameter balancing the two terms. s and N tr are the index and number of training sample points, respectively. z js is the sth observation of the output variable z j .ẑ js (W) is the prediction of z js by NNR-GL Core with the weight matrix W. The first term is the RSS between z js andẑ js (W). The second term is the GL regularization to pick up contributable ones out of input variables z 1 to z N z except z j . The weights w  (8) NNR-GL Core is trained by minimizing the objective function. Any types of optimization methods are acceptable, but here we use the most standard ones, Backpropagation and Stochastic Gradient Descent [30], [31]. Equ. (9) shows the parameter update rule. The partial derivative of the first term J 1st term j (W) of the objective function is derived by the chain rule, where z denotes the outputs of all the neurons. The partial derivative of the second term is directly and easily derived for not only MLP but also other NNs. After training and validating, the weight vector for each neuron of the input layer i.e. RC is obtained. RSS is also obtained via testing.
In Equ. (10), we define the quantity representing a co-nonlinearity calculated in Function 2. It is a multiplication of normalized RC and RSS. W 2 x is the normalized RC, that is the normalized squared L 2 norm of the weight vector of the focal input variable x. w (1) m is the weight vector of the mth input variable, and M is the number of input variables N z −1. W 2 x is normalized to sum to 1 or to make the maximum to 1. R 2 y is the normalized RSS, that is the coefficient of determination for the output variable y. y i is the ith observation,ŷ i is the ith prediction, y is the mean of y i , and N te is the size of a test dataset. R 2 y basically ranges from 0 to 1. There is a possibility that R 2 y happens to be less than 0 when regression fails. Such a negative value means no dependence, and so NNR-GL replaces it with 0. Finally, W 2 x and R 2 y are square-rooted and multiplied into the quantity or the measure value. Averaging it over initializations can VOLUME 9, 2021 raise the reproducibility of detection.

C. ANSWERS TO POTENTIAL QUESTIONS
We here answer potential questions on our measure NNR-GL. Questions and answers (QAs) concerning meaningfulness and significance are QAs 1 to 3, and those concerning technical aspects are QAs 4 to 7 as follows. QA1: ''NNs do not provide the explicit mathematical functions of dependences. Is NNR-GL based on NNs really meaningful?'' This question intends the second and third steps to brush up and formulate promising dependences (See Section I-A). These steps should be taken after the detection of potential dependences done by measures including NNR-GL in the first step for human awareness.
QA2: ''There exist NNs with GL. Is NNR-GL really novel?'' NNs with GL actually have been studied [59]- [66]. They are NNs themselves aiming at sparse modeling or variable selection, but not measures aiming at variable dependence detection (See Section II-C). They make non-contributable neurons perish for better prediction. In contrast, NNR-GL is a measure containing a NN with GL in Function 1 to detect co-nonlinearities, followed by Function 2 to derive quantities representing the detected cononlinearities.
QA3: ''What is the difference between NNR-GL and conventional nonlinear measures?'' NNR-GL is free from the problems which the conventional mapping-based and segmentation-based approaches suffer from, i.e. the limitation to two variables and the limited families of nonlinear functions. Modeling by conventional measures is done pairwise, on the other hand, modeling by NNR-GL is done output-variable-wise. NNR-GL can simultaneously model nonlinear dependences among multi variables with high performance. The concise mechanism of GL regularizing only the input layer enables NNR-GL to accept various kinds of NNRs.
QA4: ''Why does not NNR-GL take the weights over all the neurons and layers into account?'' The reason why not all the weights are regularized by GL is as below. If so, neurons highly graded by GL in the input layer may happen to be degraded by GL in the subsequent layers and vice versa. GL on only the input layer localizes variable selection to avoid such cancellation. Moreover, it is easy to embed and so enables NNR-GL to accept various NNRs. The reason why not all the weights are included in the measure value is the following. Neurons in the subsequent layers are indirectly but commonly connected to neurons in the input layer. Their weights are common for the input variables and unnecessary in measure value calculation. QA5: ''Is it reasonable to use RCs and RSS for variable dependence detection?'' When the least squares is used in single-input linear regression, RC becomes the covariance between input and output variables x and y divided by the variance of x [67], [68]. RC is thus equivalent to the CC between x and y when y has a variance of 1. Remember that CC is a measure formulating the difference between the left and right sides of Equ. (1). Therefore, RC is a kind of measures. RSS normalized by the total sum of squares is called the coefficient of determination. It equals the square of CC between x and y [67], [68], and so RSS can be a measure as well as RC. The above applies to multi-input nonlinear regression and supports the use of RCs and RSS.
QA6: ''NNR-GL requires training, validation, and test sets split from a dataset. Is NNR-GL really better than measures with no split?'' This issue is not specific to NNR-GL. The ultimate goal of a measure is to detect dependences in a population (namely, generalized dependences) using a finite sample. For that, going through training, validation, and test is essential. Strict assumptions such as linearity make a measure free from training (optimizing the shape of a dependence function) and validation (optimizing the family of dependence functions). In return, the measure sacrifices a chance to detect generalized dependences that are unexpected and nonlinear. To overcome this problem, NNR-GL goes through training, validation, and test without too many assumptions. QA7: ''How much is the computational complexity of NNR-GL?'' Most of the computational complexity depends on what NNR is used in NNR-GL. One needs to select a NNR suitable for the speed required in his/her specific task, especially considering if real-time processing is required or not. The number of combinations affects the time for computation, too. For the conventional measures, it is N 2 z due to their pairwise processing. For NNR-GL, it is N z × N rnd , where N z and N rnd are the numbers of variables and initializations, respectively. The experimental results given later in Section V indicated that the number around 3 was sufficient for N rnd . Therefore, the number of combinations for NNR-GL becomes considerably smaller than that for the conventional measures, when dealing with many variables.

IV. EXPERIMENTAL DESIGN
To fairly evaluate our proposed measure NNR-GL through comparison with the conventional measures, the experiment should be well designed. Section IV discusses the design of experiment in various aspects below: the direction and outline in Section IV-A, the competitors and settings in Section IV-B, the datasets in Section IV-C, and the evaluation criteria and methods in Section IV-D.

A. DIRECTION AND OUTLINE
Generally, the evaluation of variable dependence measures should be done stepwise [27]. At first, it is examined whether a measure behaves as expected and detects the known correct dependences (that is, fundamental effectiveness is examined). Artificially synthesized datasets meet this purpose. Such datasets enable to mathematically design variable dependences and to check the success or failure in detection using the known dependences. Next, it is examined whether a measure detects known correct dependences for real data (practical effectiveness). Using benchmark datasets on real domains with known dependences is appropriate for that. Finally, it is examined whether a measure detects unknown useful dependences in real world problems (practical usefulness). The present experiment examines the fundamental effectiveness of NNR-GL.

B. COMPETITORS AND SETTINGS
The hyperparameters, parameters, and settings of NNR-GL are listed in Table 3. To identify the baseline of performance, we decided to use the most traditional NN structure for NNR-GL, which was a MLP with sigmoid activation functions. NNR-GL is compared to the four representative measures reviewed in Section II-B, namely CC, DCC, HSIC, and MIC.
CC does not involve any hyperparameters and parameters because of its linearity [18], [21]. DCC is based on the norm of the distance of characteristic functions. It has no hyperparameters and parameters to preset or optimize [22]. HSIC is based on the covariance of data mapped by kernelization. The width of a Gaussian kernel function σ and the test statistic a are hyperparameters, which we set to the recommended values [25]. HSIC has no parameters. MIC is based on the mutual information over the segments dividing a variable space. Its hyperparameters are the coefficients α and c to determine the maximal grid size and segment size. We use their recommended values [27]. MIC has parameters to determine the segmentation pattern, and how to set them is discussed in Section IV-C.
The hyperparameters of NNR-GL are the number of layers (N L +1), the number of neurons in a hidden layer N N , and the  weight on the GL regularization term λ. (N L + 1) and N N are optimized by Grid Search of which detail is in Table 3. λ is fixed to 0.1 that performed best in preliminary experiments. With regard to training, the learning rate and the maximum epoch are fixed to 0.01 and 15000 respectively, based on preliminary experiments. The weights are initialized using a Gaussian distribution with the mean 0 and the standard deviation 1. Unlike the conventional measures, NNR-GL is applicable to both single and multi input variables. We try both and call NNR-GL with a single input variable NNR-GL(S) and that with multi ones NNR-GL(M).
For a fair comparison, NNR-GL and the competitive measures were implemented on the common platform MATLAB. For CC, we used its library equipped in MATLAB. For DCC, we downloaded its MATLAB program [69], which was not developed by the proposers of DCC. We carefully did the code review, added a function for normalization, and then used this program. For HSIC, we downloaded and used its MATLAB program developed by the proposers of HSIC [70]. The same applies to MIC [71]. We developed a VOLUME 9, 2021 program of NNR-GL ourselves using the standard commands of MATLAB.

C. DATASETS
To analytically examine the fundamental effectiveness of NNR-GL, we synthesize artificial datasets of which variable dependences are known, basically following Reshef et al. [27]. Data size is a dominant factor affecting detection performance, and so the robustness of each measure to data size is investigated. The robustness to another dominant factor, observation noise, is investigated as well. In the literature [27], they added simulated observation noise only to the output variable of a dependence function. It is more realistic to add noise to both input and output variables, and we do so.
Each dataset consists of the dependent variables x 1 and x 2 and the independent one x 3 . There are five dependence functions mapping x 1 to x 2 as in Table 4. Line is linear, and the others are nonlinear with Exp for exponential, Parab for parabolic, Cubic for cubic, and Sine for sinusoidal. The true dependences are visualized in the left of Fig. 3. The values of x 1 are generated at even intervals. By substituting them into a dependence function, the values of x 2 are obtained. The values of x 3 are generated in the same way of x 1 . Assuming the generated points (x 1 , x 2 , x 3 ) to be a population, N points are extracted from the population according to a uniform distribution. N is set to 3000, 1500, 300, or 150 [points]. To simulate real sampling, observation noise following a Gaussian distribution with the mean 0 and the standard deviation SD is added to the x 1 , x 2 , and x 3 of the extracted points. SD is varied from 0 to 40 with the step size 5 [%] of the variable range. For example, the training set of 100 points, which are split out of 300 sampled points with 1 [%] noise, is plotted for each dependence in the middle of Fig. 3.
A dataset is evenly split into three sets for training (parameter setting), validation (hyperparameter setting), and test (generalized detection performance estimation). Only a test set is used for CC and DCC, since they have no parameters and hyperparameters. A test set is used for HSIC as well. HSIC has no parameters but has hyperparameters fixed to the recommended values [25]. In our opinion regarding MIC, a training set should be used to set the segmentation pattern which is a kind of parameters. However, that was not mentioned in the literature [27]. We presume that they used a dataset for both training and test, and we follow this. The hyperparameters of MIC are fixed to the recommended values [27]. For NNR-GL, parameter setting, hyperparameter setting, and generalized detection performance estimation are done using training, validation, and test sets, respectively.

D. EVALUATION CRITERIA AND METHODS
The following evaluation criteria seem reasonable, because the correct dependences of artificial datasets are known: True Positive (TP) representing that a measure correctly identifies an existent dependence and True Negative (TN) representing that a measure correctly identifies a non-existent dependence. However, TP and TN require the binarization of measure values. For detailed analysis, it is better to estimate how close measure values are to TP and TN without binarization. We hence define two criteria called True Positive Certainty (TPC) and True Negative Certainty (TNC). TPC estimates whether a measure value is sufficiently large when a dependence exists. It equals the measure value itself. TNC estimates whether a measure value is sufficiently small when a dependence does not exist. It is (1 − the measure value).
To discuss the robustness to data size and noise rate, the curves of TPC and TNC of a measure are drawn over noise rates for each data size as in Fig. 4. In general, a signal is buried in strong noise; an existent dependence becomes difficult to detect when the noise rate is high. A TPC curve appears like an inverted sigmoid function, of which higher position represents the better TP detection. It is rare to misrecognize noise as a signal, but it becomes slightly more frequent for stronger noise. A TNC curve appears like an inverted and compressed sigmoid function, of which higher position represents the better TN detection. We qualitatively estimate performances based on the TPC and TNC curves.
Moreover, we calculate the Area Under Curve (AUC) of a curve over noise rates for each data size, as a quantitative estimation. We call the AUC of a TPC curve TPC-AUC, and that of TNC TNC-AUC. The upper limit of integration in AUC calculation, i.e. a vertical line in Fig. 4, is determined fairly for measures by the procedures below. The noise rates on the horizontal axis are picked up, where a TPC curve falls down to the measure values 0.6, 0.5, and 0.4 on the vertical axis. These noise rates are accumulated and averaged over all the measures. The upper limit for both TPC-AUC and TNC-AUC is set to this average commonly to all the measures. The mean of TPC-AUC and TNC-AUC is also used as a comprehensive detection performance criterion called CDP-AUC. Detection performances are quantitatively discussed based on TPC-AUC, TNC-AUC, and CDP-AUC.
The past studies [27], [37] qualitatively discussed TP detection performance by visualization. They did not consider TN detection performance, and such a one-sided view might lead to overestimate measures that output a large measure value even if there is no dependence. They also did not have a quantitative discussion. We design the evaluation criteria to overcome these past issues.

V. EVALUATION EXPERIMENT ON FUNDAMENTAL EFFECTIVENESS USING ARTIFICIAL DATASETS
We carried out the experiment designed in Section IV and report its details here in Section V. Sections V-A and V-B are devoted to the purpose and conditions and the results and discussion, respectively.

A. PURPOSE AND CONDITIONS
We conducted a set of experiments to examine whether NNR-GL detects co-nonlinearities correctly, compared to the conventional measures. Artificial datasets with known true variable dependences were used to estimate the correctness of detection. The experimental design and conditions were as given in Section IV. In short, the dependences Line, Exp, Parab, Cubic, and Sine were assumed between x 1 and x 2 accompanied with independent x 3 . Datasets were sampled with observation noise from each dependence, where the data size was 3000 to 150, and the noise rate was 0 to 40 [%]. Each dataset was divided into training, validation, and test sets.
The competitors CC, DCC, HSIC, and MIC were applied to each pair of x 1 , x 2 , and x 3 . There were two conditions for NNR-GL: NNR-GL(S) applied to a single-input variable similarly to the competitors and NNR-GL(M) applied to multi-input variables. The hyperparameters and parameters of measures were optimized or set to the recommended values. Three different random initializations were tried for NNR-GL. Detection performances were estimated based on TPC and TNC curves and their AUCs.

1) BEHAVIORS OF NNR AND GL
Prior to discussing the performance of NNR-GL, we confirm the behaviors of NNR and GL inside of NNR-GL. A part of the results learned by NNR is visualized in the right of Fig. 3. This was obtained under the multi-input NNR-GL(M), using a training set of 100 sample points out of 300 with 1 [%] noise. The learned dependences in the right look similar to the true ones in the left; NNR succeeded in modeling true dependences behind sampled data in the middle. To save the space, we briefly report that NNR worked well for the other data sizes and noise rates.
Let us mention about hyperparameter setting. As is common for NNs, the hyperparameter values optimized by validation in the experiment differed depending on variable dependence, data size, and noise rate. As a guide, we provide the average number of total layers including input, hidden, and output ones (# of layers) and that of neurons in each hidden layer (# of neurons). These numbers are hyperparameters with the most significant effect on regression performance. In the case of large and noiseless data with the data size 3000 and the noise rate 0 [%], # of layers were from 4 to 16, and # of neurons were from 60 to 86 for all the 5 dependences. In the case of small and noisy data with 150 and 10 [%], # of layers were from 4 to 13, and # of neurons were from 20 to 73. It is not possible to say definitely, but these numbers would be reasonable for our datasets which were not image or sound but numerical.
Regarding GL, we focus on NNR-GL(M) using a training set of 1000 sample points out of 3000 with 0 [%] noise, because this condition clearly demonstrates the behaviors of GL. A part of the results optimized by GL is summarized in Table 5. For example, 0.2092 in the left top is the L 2 norm of weights on the edges connected to the input variable x 2 . This numeric value, namely the RC of x 2 , was obtained in the regression of x 1 on x 2 and x 3 with the random initialization R1. Although differences appear to some extent depending on initialization, the trend of the results is basically consistent. Therefore, we discuss the overall trend regardless of initialization.
Logically speaking, the RC of x 1 should be the largest to represent its contribution to regression when the output vari- able is x 2 , because x 2 = f (x 1 ). Moreover, the RC of x 2 should be the largest only for Line due to linearity. As expected, the RCs of x 1 when x 2 is regressed on x 1 and x 3 are the largest for any dependences in Table 5. In case of Line, the RCs of x 2 when x 1 is regressed on x 2 and x 3 are the largest, too. In the other conditions, this trend gets somewhat blurred but still appears with the decrease of sample size and the increase of noise rate. GL worked well to differentiate dependent and independent variables.

2) EFFECTIVENESS OF NNR-GL
We move onto discussing the detection performance of NNR-GL. Figs. 5 to 8, which correspond to data sizes from 3000 to 150, show the TPC curves, TNC curves, and CDP-AUCs of the proposed and conventional measures. The results are aligned from top to bottom according to the dependences Line, Exp, Parab, Cubic, and Sine. They are aligned from left to right according to the variable combinations x 1 and x 2 , x 1 and x 3 , and x 2 and x 3 . The dependent x 1 and x 2 have TPC curves representing TP detection and no TNC curves. The independent x 1 and x 3 have TNC curves representing TN detection and no TPC curves, and the independent x 2 and x 3 have the same. Refer to Section IV-D on how to read TPC and TNC curves. The values of CDP-AUC, which is the mean of TPC-AUC and TNC-AUC over the variable combinations, are listed in the rightmost. NNR-GL behaved quite similarly under three different initializations, and so we provide only the mean measure values of NNR-GL over initializations.
In principle, NNR-GL(M) which is the original use of NNR-GL can detect multi dependences simultaneously and efficiently. However, there is a possibility that NNR-GL(M) is disadvantaged in learning, compared to NNR-GL(S). The reason is that NNR-GL(S) devotes its all learning resources to a single-to-single dependence, while NNR-GL(M) assigns those to multi-to-single dependences. Despite this possibility, the curves of NNR-GL(S) and NNR-GL(M) are almost the same in Figs. 5 to 8. Concretely, the downward triangles for NNR-GL(S) and the upward triangles for NNR-GL(M) over- lap and look like stars. The weights on multi input variables were successfully optimized to make NNR-GL(M) equivalent to NNR-GL(S), as we expected. Hereinafter, we simply regard NNR-GL(S) and NNR-GL(M) as the same NNR-GL.
In the left graph of Fig. 5 where the data size is 3000 (1000 for training, 1000 for validation, and 1000 for test), the TPC curve of NNR-GL is located in the highest position for all of Line, Exp, Parab, Cubic, and Sine. In the middle and right graphs, the TNC curves of NNR-GL lie highest as well. In the rightmost table, the values of CDP-AUC of NNR-GL(S) and NNR-GL(M) are the highest. NNR-GL outperformed the other measures consistently in terms of all of TP detection, TN detection, and the kind of dependences. NNR well modeled a variety of co-nonlinearities, and GL on the NNR's input layer well differentiated dependent and independent variables. That brought the high TP and TN detection performances of NNR-GL.
Paying attention to the other measures in Fig. 5, the TPC curve of CC takes the highest position for Line as well as NNR-GL. It appears unstably in the lower positions for the other dependences. The TNC curves of CC stay in the third highest position for all the dependences. Reflecting the TPC and TNC trends, the values of CDP-AUC differ depending on the kind of dependences. These results are a matter of course, since the model of CC is fixed to be linear. The TPC curve of DCC takes the highest position for Line, but it takes the third for the other dependences. Its TNC curves stay in the second lowest position for all the dependences. With respect to CDP-AUC, DCC is slightly better and more stable than CC. The model of DCC is nonlinear but in a limited function family. This limitation would lead to these results. For almost all the dependences, the TPC curve of HSIC takes the lowest position, while its TNC curves stay in the highest or close to the highest. Reflecting those, the values of CDP-AUC are comparatively low in spite of the broader family of nonlinear functions that HSIC has. There is a possibility that the hyperparameters fixed to the recommended values hindered HSIC from its best performance. The TPC curve of MIC takes the fourth position for Line and the second position for the VOLUME 9, 2021 other dependences. Its TNC curves stay in the lowest position for all the dependences. Due to those, the values of CDP-AUC are not so high. The flexible nonlinear model of MIC would end up with somewhat overfitting to TPs by using the same data for segmentation and detection.
Comparing Figs. 5 to 8, the TPC and TNC curves of all the measures gradually get down and unstable as the data size decreases from 3000 to 150. However, their trends are consistent throughout the figures; the TPC and TNC curves of NNR-GL are located in the highest for all the dependences and data sizes. Under some conditions with the data size of 300 or 150, the TPC curves of NNR-GL quickly fall outside of the upper limit of noise rate. That suggests that NNR-GL refrained from the overdetection of dependences under too high noise and led to better TN detection. In contrast, the measures of which TPC curves stay higher outside of the upper limit tend to have lower performances in TN detection. The success of NNR-GL in Figs. 7 and 8, of which data sizes are 300 and 150, is noteworthy. The result suggests that overfitting for small data was avoided by GL and the use of training, validation, and test sets, which are equipped in NNR-GL but not in the other measures.
Here, the findings above are summarized and concluded. In both TP and TN detections, NNR-GL outperformed the conventional measures. Its performance was stable to initialization, robust to noise rate, and robust to data size, commonly for all the dependences. Therefore, the fundamental effectiveness of NNR-GL was confirmed.

VI. PILOT EXPERIMENT USING REAL BENCHMARK DATASETS
In Section V, the experiment and analysis using artificial data suggested that NNR-GL works robustly and better than the conventional measures. It is time to broaden the horizons to the practicality of NNR-GL. Full-scale experiments will be in the next stage of our research, but we started case studies using real benchmark datasets. Section VI reports those. One case study is given in Section VI-A, and the other is given in Section VI-B. In Section VI-C, we discuss the perspectives found in all the experiments in the present research.

A. CASE STUDY USING PIMA DIABETES DATASET
To try the potential of NNR-GL for real world problems, we demonstrated a pilot experiment including two case studies using real benchmark datasets. This paper aims to pro- pose and analytically evaluate NNR-GL, and thus we briefly report the pilot experiment. NNR-GL(M) was employed here, since it worked as well as NNR-GL(S) in the evaluation experiment in Section V. The conditions of NNR-GL and the competitive measures were the same in the evaluation experiment, too. The best performances of NNR-GL and the competitive measures were estimated in the following manner, which was common to the two case studies. For each measure, the threshold of measure values was set to find the correct sets of dependent variables as much as possible. Variables exceeding the threshold were detected and assigned to the corresponding detected set of dependent variables.
The first dataset was Pima Diabetes Dataset in the website [72], of which 8 variables except the class and 768 sample points were used. We prepared the correct sets of dependent variables based on common sense and medical literature survey using [73]. There were 3 correct sets of dependent variables, {x 1 , x 8 }, {x 2 , x 3 , x 4 , x 5 , x 6 }, and {x 7 }. The first set means that the number of pregnancies x 1 and age x 8 are dependent. That was judged as correct based on the common sense. The second set contains glucose concentration x 2 , blood pressure x 3 , triceps skin thickness x 4 , insulin level x 5 , and body mass index x 6 . Their dependence was supported by a collection of medical literature [73]. The third set consists of diabetes pedigree x 7 , which was a kind of genetic factor possibly less dependent on the other variables. We applied NNR-GL and the competitive measures and obtained the detected sets of dependent variables. As shown in Table 6, NNR-GL detected all the correct sets, but the other measures did not.

B. CASE STUDY USING US-130 HOSPITAL DATASET
The second dataset was US-130 Hospital Dataset in the website [74], of which 10 variables were picked up referring the literature [75]. This dataset was split into 10 subsets consisting of 1000 sample points for 10 trials. We converted the following 6 symbol variables into the sets of indicator variables representing the absence of symbol with 0 and the presence with 1: race, admission source, specialty of the admitting physician, primary diagnosis, hemoglobin A1c, and readmission rate. These sets were obviously the correct sets of dependent variables and so targeted.
We estimated the performances of NNR-GL and the competitive measures using Equ. (11). The number of trials T is t denotes the set of variable sets detected by a measure in the tth trial. S (c) denotes the set of the 6 correct variable sets. By setting S t = S (d) t , the term in the summation becomes the precision obtained in the tth trial. Hence, Equ. (11) is the averaged precision P ave over all trials. Similarly, Equ. (11) is the averaged recall R ave when S t = S (c) . In Table 7, NNR-GL achieved the highest P ave and R ave compared to the other measures.
t for P ave and S t = S (c) for R ave

C. PERSPECTIVES FOUND IN THE EXPERIMENTS
There were only two case studies under limited conditions in the pilot experiment. However, the fact that NNR-GL outperformed the conventional measures in both case studies indicates the potential of NNR-GL in real world problems. Furthermore, we found three perspectives in the pilot experiment and also the evaluation experiment in Section V. The first perspective is inspired by the success of NNR-GL in detecting common senses and indicator variable sets. We come up with the following. By adaptively updating the groups in GL, NNR-GL will be able to gather detected dependent variables into a group as prior information. Under this prior information like a common sense, NNR-GL will detect other latent dependences in the next turn. This spiral process will carve out important novel dependences.
For ease of understanding the experimental results, we manually incorporated the detected dependences into their sets in Table 6. The same will be needed when NNR-GL is applied to real world problems. Therefore, the second perspective is a framework that automatically organizes the sets of detected dependent variables. In addition, it will be helpful for human awareness to accompany which variable is the representative of each of these sets. Although this idea was partially achieved in our past study [34], it should be improved and evaluated in detail.
It was experimentally confirmed that NNR-GL works well for the first step to detect potential dependences in Fig. 1. To more ensure the fundamental effectiveness of NNR-GL, we are planning to conduct additional experiments using different types of dependences and noise distributions. To go beyond the first step, in other words, to bridge the first step to the second and third ones, the third perspective is a way to mathematically formulate the detected dependences. The inverse mapping problem in Section III-A gives us a hint that causalities ''which variables cause which ones'' can be detected by bidirectional regressions.
Our idea is to achieve this causality detection by assuming the following: As inputs, variables yielding one-to-one or many-to-one mapping (namely, forward mapping) should be ''causes.'' Variables yielding one-to-many or many-to-many mapping (inverse mapping) should be ''results.'' We actually introduced the part of this idea to identify representative variables in the past study [34], but a thorough investigation on that should be one of our future work. Furthermore, we think that the number of turning points on a detected dependence helps to formulate the dependence function, because this number suggests the shape and degree of the function. Recently, we started working on the three perspectives above.

VII. CONCLUSION
The detection of variable dependences is essential for a broad range of disciplines. To detect nonlinear dependences among multi variables, we proposed a measure called NNR-GL. It consists of nonlinear modeling by Neural Network Regression and variable selection by Group Lasso, accompanied by detected dependence quantification and averaging for robustness. For evaluating the fundamental effectiveness of NNR-GL, we demonstrated several experiments. NNR-GL was applied to artificial datasets with several dependences under different data sizes and noise rates, and its correctness of detection was estimated. A criterion CDP-AUC, which is the overall representation of true positive and true negative detections, was used for the estimation. The values of CDP-AUC by NNR-GL were 0.7472 to 0.9681. They were higher than the values of CDP-AUC by the conventional measures, which were 0.5972 to 0.9259, for all the experimental conditions. It was confirmed that NNR-GL can detect co-nonlinearities correctly and robustly to data size and noise rate. Our future work will be the improvement and extension of NNR-GL and the confirmation of its practicality for real world problems. PATRICK THEN (Member, IEEE) is a Professor and the Head of School of Information and Communication Technologies in the Faculty of Engineering, Computing and Science, Swinburne University of Technology, Sarawak. He is also the Director of the Centre for Digital Futures, Swinburne University of Technology. He has established industry collaborations and has been managing and leading projects funded by industry and government agencies at the national and international level. His research interests include big data, data mining, health informatics, and the Internet of Things. He is a fellow of the Society for Design and Process Science, USA, and a Senior Member of Australian Computer Society.