Kernel Joint Non-Negative Matrix Factorization for Genomic Data

The multi-modal or multi-view integration of data has generated a wide range of applicability in pattern extraction, clustering, and data interpretation. Recently, variants of the Non-negative Matrix Factorization (NMF), such as joint NMF (jNMF), have allowed the integration of data from different sources and have facilitated the incorporation of prior knowledge such as the interactions between variables from different sources. However, in both NMF and jNMF, the factorization is carried out as a linear system, which does not identify non-linear patterns present in most real-world data. Therefore, we propose a new variant of jNMF called Kernel jNMF. This new method incorporates the factorization of the original matrices into a high-dimensional space. Applying our method to synthetic data and biological cancer data, we found that the method performed better in clustering and interpretation than the jNMF methods.


I. INTRODUCTION
Integration of data from different sources has become an area of intense research. In health applications, for example, high-throughput omics technologies provide a wealth of information related to different types of molecular entities (e.g., DNAs, RNAs, proteins) about cells and organisms. The integration of this vast information for multiple individuals, such as those stored in The Cancer Genome Atlas (TCGA) [1] and The Cancer Cell Line Encyclopedia (CCLE) [2] projects, allows the identification of associations between different sources, and to find groups of related molecules across different layers of information. In addition, the use of machine learning methods to integrate these data opens a diverse field of possibilities to improve the discovery of patterns embedded in the original data [3], [4].
Among the strategies for multi-view integration of data, three main groups, based on dimension reduction and ensemble methods, stand out (early, intermediate, and late integration). In early integration, the matrices are concatenated to make feature selection or decomposed in principal components (PC) to create new variables used as input in some machine learning models. In intermediate integration, the data are initially processed individually, for example, The associate editor coordinating the review of this manuscript and approving it for publication was Juan A. Lara .
using kernel functions to extract non-linear patterns of the data and use these patterns as input in an ensemble model. Finally, late integration consists of integrating ensemble models, where individual models are generated for each input, and then the results are integrated into a final model by voting or averaging strategies [5].
As in conventional machine learning, multi-view data integration can be classified as supervised or unsupervised methods. Among the unsupervised methods based on the Non-negative Matrix Factorization (NMF) technique, the joint NMF (jNMF) is the benchmark for intermediate integration [6]. While in the NMF, a matrix X is linearly factorized into two low-rank matrices (i.e., a base matrix W and a coefficient matrix H), in the jNMF, this process is done simultaneously for different input matrices X 1 , . . . , X M . The resulting factorized matrices correspond to a common base matrix W and as many coefficient matrices as the number of input matrices. The advantage of jNMF over NMF is that it allows finding the common centroids for all samples in the base matrix, whereas the clusters and co-clusters assignments in the coefficient matrix favor the interpretation of embedded patterns [7]. In addition, some extensions of jNMF may help the interpretation of clusters. This is the case of Orthogonal integrative NMF (iONMF), which uses an orthogonal-regularized penalty to avoid non-overlapping features within clusters providing interpretable models [8].
Despite the versatility and wide range of applications of these methods, both NMF and jNMF are linear models, where the data observed are decomposed in linear combinations of columns of W and rows of H. However, in many cases, it is expected that the relations between basis vectors are non-linear given the nature of the data [9], [10]. A possible approach to find these relations is to map the observations (x i ) into a higher dimensional space where more meaningful associations may be found. The idea is to use a mapping function (φ(.)) into a high-dimensional space that is not explicitly known but is provided with a scalar product expressed as a kernel function (K (x i , x j ) = φ T (x i )φ(x j )). In this new space, we can apply the NMF method and achieve a better pattern separation. This approach was used in NMF [11] by comparing two NMF based on kernel functions: the Polynomial NMF (PNMF) and the Gaussian NMF (GNMF), where the classification task accuracy performed better in GNMF than PNMF.
This article proposes an extension of the jNMF framework for multi-view data integration based on kernels. This implementation, called Kernel joint NMF (KjNMF), was tested on simulated data and observational data from the TCGA project. In both types of data, we incorporated sparsity and prior knowledge constraints to improve cluster identification. In addition, several methods of normalization of the W and H I matrices were tested to improve clustering. We found that, compared with the standard jNMF method and iONMF method, KjNMF performs better by increasing the cluster quality and, thus, leading to a more accurate interpretation and identification of real clusters.
In this research, we use the following notations: 1) Input non-negative matrices are denoted by bold capital letters, e.g., X. As several matrices contain data on the same samples from different sources (M ), the sub-index I is added to indicate the particular source X I where I = 1, . . . , M . Each matrix has dimensions of n samples per p i variables. 2) Low-rank matrices are also defined by bold capital letters (A, W or H), but their dimensions will depend on the rank (k) that is determined beforehand. 3) In the NMF methods, W is the basis matrix, and H is the coefficient matrix. In the convex-NMF methods, W is a linear combination of columns of X (x i ) and rows of A (a ji ), a matrix of weights. 4) I and R IJ are sparse matrices that relate the association that exist between variables of the input X I matrices. I relates the variables of X I , whereas R IJ relates the variables of X I and X J . 5) A kernel is a function that, for all pairs of points x i and where φ is a mapping function from X to an feature space F, i.e., φ : x −→ φ(x) ∈ F. 6) The φ function maps a column of X I to the feature space. So the mapping of the complete matrix is defined by (X I ).
7) The hyperparameters are defined as lowercase Greek letters (λ, γ , and ω). 8) ρ corresponds to the cophenetic coefficient and AUC to the Area Under the Curve.

II. RELATED WORK
NMF has proved to be useful in a wide range of applications in pattern identification and classification. However, there are still very complex real-world data structures that may need methods beyond the standard linear Euclidean formulation to be discovered. The use of non-linear approaches can increase the performance of NMF results [10]. The natural alternative to include non-linearity in the factorization is to map the observations to a higher dimensional space or ''feature space,'' where the clusters or patterns are better defined and, therefore, the method of separation is feasible. The common way is to formulate the NMF optimization problem in terms of the dot product (X T X) to incorporate kernel functions, such as Gaussian or polynomial, to do the mapping. A convex-NMF has been proposed in [9], which postulates W as a weighted of columns in X, i.e., w i = m j=1 A ji x i . Because of this, convex-NMF is established in the form X − XAH 2 , and the update rules depend on X T X.
Several attempts to kernelize NMF have generated variants that showed high efficiency in clustering and pattern recognition in recent years. A method using the semi-NMF strategy was proposed to incorporate the kernel function [12]. This method showed that the discrimination of captured information worked better than the standard NMF method. Other approaches, such as online Kernel NMF (OKNMF), are based on the convex-NMF method with a set of constraints to determine the number of the basis of the feature space [13]. In general, this method performed well and was computationally efficient. A flexible kernel NMF (KNMF) method has also been proposed [11], which presents a general formulation of Gaussian kernel function implementation in NMF.
Among the methods for data integration, some use variants of the NMF method. The jNMF method, for instance, has been proposed to standardize the use of NMF into multi-view data [6]. This procedure adapts the two low-rank non-negative matrices obtained in NMF: the basis matrix W and the coefficient matrix H, into a common basis matrix W for the multiple views but maintains the coefficient matrix H I for each type of view. The goal is to detect a common pattern for all individuals while maintaining the specific patterns of each view. In this framework, the use of constraints in the form of prior knowledge can increase the clustering performance by relating features through the within-relationship matrix I and the between-relationship matrix R IJ [6]. I constraint identifies if there is an existing relationship between feature i and feature j in the same view matrix. R IJ constraints identify if there is an existing relationship between feature i in matrix I and feature j in matrix J . This prior information is included in the framework as constraints in the loss function.
In addition, given that these methods may present a non-unique solution, it has been recommended to normalize the matrices obtained during the convergence process. As stated in [14], the normalization over the matrices could affect the efficiency in the NMF cluster allocation process because normalizing the matrices alleviates the uncertainties of pattern detection by reducing variations in the data or the need for some initial conditions.

A. JOINT NON-NEGATIVE MATRIX FACTORIZATION
The jNMF method integrates data which consist of n individuals from whom p i variables have been measured from each different source, forming a matrix X I for I = 1, . . . , M . This method was adapted by extending the concept of single matrix factorization (X) to multiple types of matrices (X I ) [6]. Therefore, jNMF factorizes problem (1) the input matrices X I ∈ R n×p i into a low-rank matrices: a common basis matrix W ∈ R n×k and coefficient matrices H I ∈ R k×p i for each type of data source available, where k is the range of factorization that usually is much less than each p i . Matrices H I and W correspond to the solution of the minimization problem: with respect to W and each H I . Furthermore, prior knowledge can be incorporated in the jNMF problem as constraints by using the matrices R IJ ∈ R pi×p j and ∈ R p i ×p i , which contain the relations between and within the variables of the different data sources. These constraints help reduce the instability of the estimation due to the non-uniqueness of the NMF solution, which creates meaningful clusters. In the case of omic data, for instance, R IJ could express if a specific gene from X I is related to a specific miRNA from X J , i.e., variables from different sources are being related. Similarly, the I matrix contains information about possible relations between variables of the same data source; for example, protein-protein interaction and metabolic pathways relate gene interactions. The superscript (t), i.e., (t) I can be used when several type restrictions are used on the same matrix X I . In both types of restrictions, the matrices are binary coded, taking a value of 1 when there is a relationship and 0 otherwise. These matrices are included in the jNMF optimization problem as constraints and regularization terms to favor the correct clustering. The estimation results as in (2).
Another variant of jNMF corresponds to the incorporation of orthogonality constraints which forces the basis vectors to be orthogonal and prevents overlapping of features within the clusters [8]. These constraints generate more interpretable models by better discriminating the features, where the orthogonality of the columns of H I is controlled by incorporating the hyperparameter α. The objective function (3) is minimized to obtain W and H I matrices.
Given the non-linear structure of real-world data, kernel-based methods make use of a function φ that maps the original observations (columns) of X = [x 1 , . . . , x p ] to a higher dimensional space or feature space (F) to obtain the mapping (X) = [φ(x 1 ), . . . , φ(x p )]. The goal of this mapping is to find a way to linearize non-linear relationships in the feature space. The feature space is a vector space that works with finite or infinite dimensions, which must have the properties of being separable and complete. Its inner product exists for any pair of points (φ(x i ) and φ(x j )). When the mapping function (φ) is applied to X, the coordinates of the observations in this feature space are in general unknown. However, it is possible to calculate their pairwise inner product, a metric that defines the distance between pairs of points. In other words, this process creates a distance between observations in F where they seek relationships that can be represented linearly. If the optimization problem is defined using the inner product of the observations in the feature space, then a kernel function is easily applicable without the need to specify the φ function.
To modify the objective function of (2) in terms of the inner product of the input matrices X I , we consider the approach based on the convex-NMF method [9], which proposes that the centroids obtained in W are a convex combination of the observations (columns), i.e., W = XA. This strategy has several advantages: (i) it is possible to impose non-negativity and sparsity constraints on A I ; (ii) it generates a sparse and interpretable solution since it allows to identify how the clusters would be composed; (iii) the matrix X can have negative values; and (iv) the objective function remains in terms of the kernel function [9], [15]. In our case, we formulated the basis matrix W as (X I )A I . It is important to note that in the original jNMF problem, there is a single W that allows the integration of the samples based on the information in each input matrix. This situation can be understood as follows: given the information provided by the input matrices, the low-rank matrix W contains the centroids of the samples that gather all this information. For this reason, we introduced a fourth term that allows us to approximate a single W * , where all products (X I )A I are similar between them. This approach was used in the jNMF implementation of [16]; here, matrices W I , where the parameter λ forces the degree of similarity between these matrices. Therefore, we proposed the fourth term in (4) with the parameter ω to control this matrix similarity. In order to control on sparsity and scale of low-rank matrices [6], we introduced two additional terms: a fifth term to control the scale of A I and a sixth term to induce the matrices H I to be sparse. These terms control cluster formation and pattern detection. The parameter γ 1 controls the degree of growth of A I and γ 2 controls the required sparsity as we shown in (4).
We show a graphical representation of the KjNMF method in Fig. 1 and the formulation of the objective function in (4).

IV. FACTORIZATION ALGORITHM
The standard method to solve the optimization problem in jNMF and NMF methods is the multiplicative update rule (MUR). Although diverse variants have been generated to reduce computational resources, MUR remains a competitive algorithm with reasonable performance [6], [17]. Our algorithm works in a similar way to the multiplicative rule of jNMF or NMF. Thus, since the optimization problem is not convex, we initially fix the matrix A I and update the matrix H I , and then we fix the latter (H I ) and update the former (A I ). To find the multiplicative rules in our KjNMF context, we derived the objective function defined in (4) and use an approach based on [11] to generate MUR in this kernel problem. As we have mentioned, the base matrix in KjNMF is a linear combination of the mapped data using the φ function, so our objective function can be expressed as (5).

Tr(K I − K I A I H I − H T I A T I K I + H T I A T I K I A I H I )
where K I = (X I ) T (X I ) is the kernel matrix within data in matrix X I , and K IJ = (X I ) T (X J ) is the cross-kernel matrices X I and X J . It is important to mention that the kernels are created on the columns, and, therefore, a kernel of the type K I will have dimensions of p I × p I . The Lagrange function L associated with the minimization problem in (4) are the parameters from the non-negativity constrains. The partial derivatives of L with respect to A I and H I are respectively: From the Karush-Kuhn-Tucker conditions φ I ij (A I ) ij = 0 and ψ I ij (H I ) ij = 0, we calculated the update rule for (A I ) ij and (H I ) ij , (8) and (9), as shown at the bottom of the page.
For a complete convergence analysis of the multiplicative algorithm check Supplementary Section S1.
Although NMF methods are non-deterministic polynomial-time (NP-problem), and it is not easy to find a global optimum, it is feasible to identify a local optimum [18]. Therefore, we define a stopping criterion that evaluates the relative difference between two consecutive iterations of the objective function (F) assessed at iteration t and t + 1. The algorithm stops when reach a threshold τ , i.e., F t − F t+1 /F 0 − F t+1 ≤ τ ; in our case, the stopping threshold was set to 10 −6 .
The multiplicative rules were implemented in an iterative process as described in Algorithm 1. Additionally, I and R IJ constraints are added, and the joint factorization is performed on the feature space. The resulting low-rank matrices are A I and H I for each input matrix. The product φ(X I )A I is penalized for being similar among the different inputs.

V. HYPERPARAMETER SELECTION
We selected the best set of hyperparameters (λ 1 , λ 2 , γ 1 ,γ 2 , ω, and k) according to metrics that evaluate the performance of jNMF, iONMF, and KjNMF methods. We used these metrics to measure the stability of the clusters when the algorithm is running (Section V-A). Also, we used them to evaluate the clusters created in a test dataset (Section V-B). In addition, for biological data, it is expected that the clusters will have an interpretation according to the disease or biological process under study (Section V-C). Therefore, the scale for these metrics is between 0 and 1, with 1 being the best-expected result. For the iONMF method, we evaluated the range (k) and the orthogonality hyperparameter (α). We used the same set of metrics to select the best set of hyperparameters.

A. CLUSTER STABILITY
The cluster stability measure depends on the connectivity and the consensus matrices. There are as many connectivity matrices as there are types of data sources M in the model. For each type of data source, the connectivity matrix C (n I ×n I ) I is a squared matrix which indicates if a pair of samples p 1 and p 2 of the same type of data source I belongs to the same cluster, i.e., C I [p 1 ][p 2 ] = 1, 0 otherwise [19]. Then, there is the consensus matrixĈ n i ×n i that measures the proportion of times that the samples p 1 and p 2 are allocated at the same clusters. The closerĈ[p 1 ][p 2 ] is to 0, or to 1, the more stable the allocation process is for these two samples. In addition, we used the consensus matrix to calculated the Cophenetic coefficient (ρ). Accordingly to [20], ρ is calculated as: The larger ρ the better because the structure of the clusters across iterations remain similar [20].

B. AUC ON TESTING DATA
We calculated Area Under Curve (AUC) to investigate the accuracy of identifying embedded patterns by comparing VOLUME 9, 2021 training and testing coefficient matrices. We used a similar approach to the one stated in [6], i.e., to extend the jNMF method to the prediction form. In this procedure, we split the data into train and test datasets. Firstly, we trained the model to find the optimal base and coefficient matrices (( (X I )A I ) Train and H Train I ). Secondly, we set the base matrix (( (X I )A I ) Train ) obtained from the training set and update the coefficient matrices from the testing set (H Test I ) until reaching the threshold τ (Section IV). Thirdly, we ensure a perfect matching for the rows of H Train I and H test I by using the strategy proposed by [6], i.e., we used the Hungarian algorithm which selects what row of H Test I corresponds to one row of H Train I . Finally, we calculated the AUC for each data source. In this way, we expected to find the same compartmentalization of the variables independently of the origin of the samples.

C. BIOLOGICAL RELEVANCE MEASURE
We performed a gene enrichment analysis using the clus-terProfiler v.3.14.3 package [21]. This analysis was done for each cluster to determine associations with ontological groups, metabolic, and signaling pathways. In addition, we used OmicsNet [22] and Ingenuity Pathway Analysis [23] to interpret clusters of different molecules (co-clusters).
We used this information to calculate the number of molecules used on average across all omics, the number of genes included in an enriched term on average across all terms (GeneRatio average), and the ratio of enriched terms detected in the clusters. As a final metric, we calculated an average between ρ, AUC, GeneRatio average, and the ratio of captured terms as a metric for selecting suitable hyperparameter sets. We call this metric a ''performance score,'' which is acceptable when near to one.

VI. CLUSTER MEMBERSHIP ASSIGNMENT
As mentioned in [9] and [16], the values in the rows of the coefficient matrices (H I ∈ R k×v i ) could be interpreted as the degree of belonging to the respective cluster. In NMF, the clustering membership assignment for a variable v is as easy as finding the row with the maximum value of the v th column on the matrix H I . Nevertheless, this assignment rule can assign different variables to a cluster k, and their assignment values can vary, so we decided to implement a level of certainty of the membership assignment to a cluster among the comparable degrees of the associated samples.
With that in mind, the implemented process defines whether the p th variable belongs to a cluster k th . In the first step, we assigned the clusters, as usual, finding the row with the maximum value of each column in the matrix H I (green-colored cells in Fig. 2). In the second step, we checked if the value in H I of those samples already assigned to a cluster k th is at least greater than the first quartile of the value of belonging to the cluster k th of all the variables. Fig. 2 exemplified these assignments. Briefly, in this example, three variables (v 1 , v 2 and v 4 ) were assigned to the cluster k 3 in the first assignment. We calculated the third quartile of this vector [0.57,0.96,0.92], which was 0.94. Then, we deallocated the FIGURE 2. Assignment rule for variables to each cluster. We made two assignations using the maximum and second maximum per column, deciding whether it belongs to a cluster within these values exceeds a threshold defined by the quartiles (Q). For example, the first assignment used the Q3, in the second assignment used Q3 + 1.5 * IQR. The final clusters contain the variables obtained from the two assignments.
variable v 1 since it has a lower degree of belonging to the cluster k 3 than those assigned to that cluster. In the second assignment, we applied the same strategy for the second maximum degrees of each column, but taking values greater than the third quartile plus 1.5-fold the interquartile range. We took this decision because a biomolecule can participate in several biological processes, so it would be very restrictive not to allow this inclusion. It is important to denote that some variables could end up deallocated with this new rule for the membership assignment process.

VII. NORMALIZATION OF A I AND H I
The non-uniqueness problem of the resulting matrices in NMF methods leads to misinterpretations of the clusters or patterns. This situation may be due to noise in the data or the initial conditions of the algorithm. Several strategies in the normalization of W or H I during the computation of the multiplicative rules have been proposed. By exploring how the normalization of the low-rank matrices affects the NMF performance, it has been found that the normalization choice could regulate the non-uniqueness of the NMF, increasing the performance on clustering [14]. The normalization process consists of using mapping functions (f (.)) on the columns or rows of the low-rank matrices to obtain diagonal matrices that multiply each value in the low-rank matrix. By using this approach, it was found that the maximum norm ( . ∞ ) performed better than traditional normalization methods [14].
We implemented different normalization mapping functions to analyze the variation on the clustering allocation process due to the normalization method. For example, in [14], the normalization process divides each column of W by the value of a mapping function (f (.)) that is applied for every column of W; in the case of KjNMF, this process is applied to A I . Accordingly to [14], every normalization method should be applied as W = WD −1 and H = DH, where D is a diagonal matrix of dimension k (the range). The k th value of the diagonal of D is the result of mapping function (e.g., . ∞ ) applied to the k th column of W (w k ). For example, the D matrix for W is: There are different ways to find this D matrix. In [14], for instance, they used the columns of the W matrix to be mapped into various norms to find the values of D. The product WD −1 means that each value of the diagonal of D divides each value of the k th column of W, while the product DH is multiplied by the k th row of H. We investigated other alternative cases of normalization based on the original jNMF by [6] and the research of [14]: 1) Adapted Zhang method [6]: they concatenated all the H I matrices along the common rows (k) and applied the mapping function f (H concat

VIII. EXPERIMENTS
We presented two evaluation scenarios for the proposed algorithm (KjNMF) and linear methods (jNMF and iONMF). The first case used synthetic data defined by isotropic Gaussian distributions, meaning that it can be considered as spheres immersed in others. The second case is related to the bioinformatics field, where we tested the implementation on cancer omic data since these types of data contain a non-linear structure and several patients with different characteristics. In addition, using biological databases to create prior knowledge constraints allowed for a higher potential to find biomarkers or biological pathways associated with a particular disease. Fig. 3 shows a graphical representation of these relations in both synthetic data and TCGA datasets.
In each case, we iterated the algorithm in 3-fold crossvalidation. We calculated and averaged the ρ metric and the AUC in testing data for each repetition. Numeric results for synthetic and biological datasets are in Supplementary File 1 and File 2, respectively. We estimated the time complexity of one iteration round of the MUR method for our method. Initially, we increased the number of variables and samples from 200 to 5000. Then, we fit a linear regression to identify the degree of the relationship between dimension and execution time, we found a linear relationship. For synthetic data, the average time for one iteration round was 0.27 seconds for all the normalization rules except for rule 4, whose time was 29 seconds on average. In addition, we used a 2.20GHz Intel Corei7 processor with 16GB RAM.

A. SYNTHETIC DATA
We created three data sources with an equal number of samples n and a different number of variables p (200, 195, and 375). We employed the package Scikit-learn v0.23.2 to generate a mixture of isotropic Gaussian datasets, which created a classification dataset of nested concentric multi-dimensional spheres. For each X I matrix, we generated two isotropic Gaussian datasets: the first dataset with µ = 2 and σ 2 = 2 and the other dataset with µ = 4 and σ 2 = 1. In both cases, we defined two classes, i.e., we assigned each observation to a cluster (Supplementary Fig. F1). We created the R IJ and I constraints using a random sparse matrix of 0 and 1.

1) KjNMF WITH DIFFERENT KERNEL FUNCTIONS
The usefulness of kernel-based methods is that a wide range of kernel functions can be used. We compared the KjNMF method with three kernel functions: Gaussian, ANOVA, and polynomial (degree = 2) kernels. We observed that the ρ and AUC metrics remained similar between these functions VOLUME 9, 2021 ( Supplementary Fig. F2). These results are consistent with the mathematical definition, at least between Gaussian and ANOVA kernels which use a radial basis function. The polynomial kernel of degree 2 in some cases was below the previous kernels. Then, we show that the method allows the incorporation of other kernel functions specific to different tasks, e.g., the ANOVA kernel is helpful for multidimensional data. In the following, we employ the Gaussian kernel because of its usefulness in multi-dimensional data, and unlike a polynomial kernel, no prior knowledge of the data is required [9], [10].

2) KjNMF EVALUATION AND COMPARISON WITH LINEAR jNMF METHODS
The hyperparameters were evaluated on a random grid where λ 1 , λ 2 , γ 1 , γ 2 and ω were defined in a range of [1 −8 , 1] for the normalization methods 1, 2 and 3, and in a range of [1 −10 , 1 −7 ] for the normalization methods 4 and 5 (SectionVII). We selected this set of hyperparameters values for normalization methods 4 and 5 because they use l 2 and l 1 norms. Penalizing with large values generates low-rank matrices with values at or near zero, which leads to numerical problems. We tested a total of 300 scenarios for all the normalization methods. We assessed the range (k) between 5 and 100. For KjNMF, σ , the kernel hyperparameter, was evaluated between 1 and 5. Fig. 4 shows the behavior of the AUC and ρ in the different types of normalization methods for iONMF, jNMF, and KjNMF. We found a similar behavior between all types of normalization for these methods. In both metrics, KjNMF performed better than linear methods, where the AUC was above 0.9 for KjNMF in most cases, whereas it did not exceed 0.6 for jNMF and iONMF ( Supplementary Fig. F3). In all three algorithms, the results of ρ were very close to each other (ranging between 0.8 and 1.0). In addition, in the fifth normalization method, we observed that the behavior of ρ is very stable in both algorithms, although KjNMF always presents better performance than jNMF and iONMF methods. These results show that the clusters are stable (ρ with high values), i.e., clusters are similar in each run. However, in the case of KjNMF, the clusters are similar to each other (AUC ≈ 1) independently of the sample type (train and test), whereas in jNMF or iONMF (AUC ≈ 0.6) the clusters formed are different for each sample type.

B. BIOLOGICAL DATA
We used datasets from TCGA that contains measurements of gene expression at the level of both transcription (mRNA) and translation (protein), RNA-based gene regulation (miR-NAs), and genetic structural variation (CNV) from cancer patients. Since the number of genes was very high (∼ 20K ), we selected the genes associated with specific cancer using the MalaCards database (Supplementary Section S2). In addition, we obtained prior knowledge of biological molecule associations or interactions to create R IJ and I constraints by using publicly available biological databases (composition of prior knowledge constraints in Supplementary Section S3). The databases and packages used were: • BioGRID v3.5 [24]: gene-gene associations. • STRINGdb v9.1 [25]: protein-protein associations. • KEGG graphite v1.32.0 [26]: gene-gene associations in metabolic pathways.
We compared the performance of the jNMF, iONMF, and KjNMF algorithms across three TCGA cancer types: BReast CAncer (BRCA), Low-Grade Glioma (LGG), and LUng ADenocarcinoma (LUAD). We obtained the data using the TCGA-Assembler tool [30]. These cancer types were selected not only because of the dimensionality of the data and heterogeneity of the disease but also because of its epidemiology: where breast and lung cancer are the predominant neoplasms in adults, whereas glioma is the most common solid tumor in children. In addition, we evaluated LGG as a specific study case for clustering interpretation.
Considering that different sets of hyperparameters can be evaluated for all the algorithms, we found that the kernel implementation is superior to the linear methods. Therefore, we evaluated normalization methods to identify the range (k) with good performance of stability (ρ ≈ 1 and AUC ≈ 1) and biological interpretation (refer to Table 1). The normalization method 4 for KjNMF presented good scores of these metrics for the three types of cancer evaluated. The ρ evidenced cluster stability as this value is an average of each MUR repeat cluster structure (greater than 0.90). Concerning  TABLE 1. Metric values for different types of cancer data using the five normalization methods for the factorization algorithm KjNMF and jNMF. The biological terms detected is the ratio between the number of enrichment terms detected over the total enrichment terms.
the AUC, when an AUC is close to 1, it means that the train H I matrix is very similar to the test H I , then the assignation of variables within the different clusters is similar. This result can be understood because they come from a common type of cancer, and certain molecules will maintain their relationship independent of the sample's origin. Finally, the enrichment ratio (the quantity of clusters enrichment over the range) was above 0.60 in the majority of cases (refer to Table 1 and Supplementary Fig. F4 and F5). The other four normalization methods had a good performance, but they were not as consistent as method 4.

2) EFFECT OF PRIOR KNOWLEDGE AND SIMILARITY CONSTRAINTS IN KjNMF
We evaluated whether prior knowledge constraints have any effect on cluster formation. We found that metrics such as the gene ratio average in KjNMF was 0.45, whereas jNMF was below 0.25 (p-value < 0.001, unpaired two-samples t-student test). This result suggests that the jNMF method is more unstable due to the lack of terms that allow the incorporation of prior knowledge ( Supplementary Fig. F6). On the other hand, we tested for the omission of the hyperparameter ω, which controls the similarity between the products (X I )A I . We found a negative effect on the AUC because it dropped to less than 0.6 (p-value < 0.001, unpaired two-samples t-student test). These results indicate that the KjNMF method can still cluster the features using the non-linear data but loses the ability to use predictive cluster identification in a test dataset. For this reason, it is recommended to include these constraints.

3) CASE OF STUDY: LOW-GRADE GLIOMA
We evaluated the case of LGG, a type of brain cancer, to assess the knowledge identified by the clusters using the KjNMF method. We found differences in the number of enriched clusters generated by both algorithms. We also noticed that although the proportion of variables used in KjNMF was less than in jNMF, with a value close to 0.75 ( Fig. 5a and Supplementary Fig. F7), this did not reduce FIGURE 5. Evaluation metrics for KjNMF, iONMF, and jNMF using LGG data. (a) the ratio between variables assigned in clusters and the total variables, and (b) Performance score, which refers to the ratio of enrichment terms captured by the clusters over the number of total terms (using all the genes), the higher the performance. the number of enrichment terms found for KjNMF. The Performance Score (enriched and stable clusters) obtained with KjNMF was higher or equal in most cases than those obtained with jNMF and iONMF ( Fig. 5b and metrics plots of LGG in Supplementary Fig. F5). This led to a better distribution of the groups with a good density of molecules per cluster, facilitating its biological interpretation. Consequently, we found that there are sets of hyperparameters that can satisfy all evaluation metrics, i.e., finding reliable results for interpretation.
We used clusterProfiler v.3.14.3 package [21] to perform gene enrichment analysis for each one of the 30 clusters. As a result, we found 196 enriched terms (Supplementary File 4), of which 36 were not identified by jNMF (by using the best set of hyperparameters used in Table 1). Among the new terms, we highlight the terms associated with oxidative phosphorylation, aminoacids metabolism, and glycosphingolipid biosynthesis because this allows us to identify that the clusters contain information on basic and more complex processes related to the disease.
Using jNMF methods permits finding an association between clusters, i.e., the k th cluster of miRNA, proteins, and genes can be biologically associated; this is known as a co-cluster. We investigated whether there was any relationship between gene/miRNA and gene/protein co-clusters. In the case of gene/miRNA co-cluster, we explored the top 5 co-clusters with the most molecules. We used OmicsNet tool [22] and miRNet v.2.0 [28] to interpret co-clusters between miRNA and genes. The enriched terms involved in the co-clusters were related to cancer processes as presented in Table 2 (Supplementary File 5). For example, gene cluster No. 7 and miRNA cluster No. 7 have the Fc RI signaling pathway and T-cell activation terms. Interestingly, these two pathways are related in the context of type I hypersensitivity reactions. Briefly, allergens stimulate T helper type 2 (Th2) cells to secrete interleukin (IL)-4, causing B cells to release immunoglobulin (Ig)-E, which bind to the high-affinity Fc RI of the mast cells, which, as a consequence, release biologically active mediators, FIGURE 6. Enrichment analysis of OmicsNet for three layers. Highlighted nodes belong to the co-cluster 21 that we selected to analyze. Dark blue-filled nodes correspond to the related biomolecules in the enriched terms listed above. The size of the nodes is for visual perspective.
causing an allergic reaction [31], [32]. In cancer, the studies of atopy as a risk factor for cancer have reached contradictory results [33], [34]. Therefore, our multi-omic strategy can be used to explore whether allergic diseases might be associated with cancer risk.
To interpret a specific co-cluster case for gene/protein and gene/miRNA, we explored co-cluster No. 21, which contains 157 genes, 54 miRNAs, and 47 proteins. First, we used OmicsNet tool [22] to obtain an interaction network between omics. Then, we used a minimum-network option in OmicsNet tool [22] because it generates a network with the input molecules, which allows us to identify associations between them. This network showed a relationship among the molecules of the three input omics (Fig. 6).
We further inspect gene/protein associations using Ingenuity Pathway Analysis (IPA) software [23]. The enrichment analysis for genes/proteins comprises the PI3K/AKT signaling, Insulin Receptor Signaling, IGF-1 signaling, and other biological pathways (Supplementary File 6). This cluster grouped genes and proteins that were associated with tumor cell growth and survival. Our results, therefore, suggest that our KjNMF method improves the association in terms of enrichment. We also highlighted that, even by using a restricted set of genes, we were able to identify essential clusters related to low-grade glioma, i.e., KjNMF does not create clusters by randomly assigning genes but rather by incorporating all the information from the omics and previous knowledge.

IX. DISCUSSION
In different areas such as health, bioinformatics, or robotics, data integration from different sources becomes indispensable for decision making. In particular, data integration becomes increasingly important in biological data science to reveal new and meaningful insights into the complex pathological processes and disease pathways. Furthermore, since biological data represent complex systems, there is a need for developing strategies that facilitate their understanding. Therefore, we proposed this methodology of omic data integration through joint factorization based on kernels.
Here, we presented the KjNMF, which is a method that takes the standard jNMF method based on kernels to perform the factorization in a high-dimensional feature space. Different kernel NMF approaches have been developed, and their advantage over the conventional method has been demonstrated [10]. However, we went one step further to integrate prior knowledge and explore the advantages of doing this integration in a feature space where linear relationships can understand the non-linearity in the patterns of the real-world data.
Using the Gaussian isotropic distribution to create synthetic data, we found that our method performs better than the standard jNMF methods. Furthermore, it has been shown that by incorporating non-linearity into the linear methods, they can perform much better with high-dimensional data [35]. To establish the best performance of KjNMF vs. jNMF methods, we defined a set of metrics capable of identifying the best set of hyperparameters. These hyperparameter selection metrics required the following conditions to be met: high cluster structure stability (ρ ≈ 1) and high cluster stability on a train and test dataset (high values of AUC). KjNMF satisfied these conditions, which are attractive because, in the standard jNMF methods, the AUC remained low (≈ 0.60), i.e., the data patterns hold just some interpretation, which may be different for these patterns. In both cases, ρ remained above 0.8, which means that the structure of the clusters remains similar in each iteration of the algorithm. Therefore, it is clear that the synthetic data are not suitable for linear methods since they contain non-linear structures.
In the biological domain, we found that molecules were not randomly assigned to clusters; that is, both methods, jNMF and KjNMF, found meaningful clusters. We identify two advantages of KjNMF over jNMF and iONMF methods: (i) the enrichment analysis of KjNMF showed new biological terms that jNMF could not detect, (ii) there was also an association between co-clusters of different molecules, i.e., the co-cluster associations are created between similar processes such as immune response processes. This result is relevant because other authors have found no direct relationship between co-clusters [7]. We can attribute this also to the inclusion of prior knowledge constraints, which narrowed the search space of feasible solutions, i.e., they force the formation of these clusters. We incorporated these specific constraints for each layer, just as the standard jNMF method [6]. These constraints allow associations between molecules of the same category (e.g., gene-gene) and between different categories (e.g., miRNA-protein).
Along with this, we also explored the normalization methods during the updating of A I and H I matrices, since these methods reduce the non-uniqueness problem of NMF methods [14]. As it was evaluated in [14], the normalization allows the stability of the obtained clusters avoiding that initial conditions or noise alter its formation. Methods 1, 2, and 3 are modifications of the original jNMF implementation. We found that the performance of the jNMF method presented was lower in the AUC metrics and the information captured (number of enrichment clusters) than the KjNMF method. Methods 4 and 5 use l 2 and l 1 norms, respectively. Specifically, method 4 was the one that showed the best performance in all the evaluated scores. The use of the norms (l 1 and l 2 ) allows the resulting matrices to consist only of the relevant components. Therefore, the noise of the data may influence the correct formation of the clusters. We recommend the use of the normalization method 4 since the l 2 norm can reduce many values near zero without loss of information [36]. The other normalization methods were associated with a low AUC, so the clusters are likely affected by the origin or noise of the data (X test I ). As a consequence this may affect the interpretation of the clusters.
The main limitation of the proposed method is related to the loss of interpretation of the samples or patients. This situation is because, in the jNMF, the base matrix (W) has a centroid interpretation where each sample will belong, e.g., clusters of patients with distinguishable biological characteristics are generated. The probability of survival, for instance, is statistically different between the clusters obtained by using the jNMF methods [37]. This problem is not easily achieved in our approach since this matrix is located in a feature space of infinite dimensions. Despite this, pre-imaging research based on different strategies could be explored in the future [38], even though a disadvantage of using pre-imaging is that the solution is not unique. However, we believe that the proposed approach represents an advance in data integration using kernel-based joint matrix factorization. Thus, our method applies to other areas such as hyperspectral image and signal processing and in personalized medicine with the use of other biological databases. VOLUME 9, 2021

X. CONCLUSION
A method of integration using joint non-negative matrix factorization based on kernels (KjNMF) was presented. This method used the Gaussian kernel, demonstrating its efficiency against the standard jNMF method. The advantages over the standard jNMF method are: (i) it allows working with data with non-linear structures, (ii) it allows identifying associations between variables (interpretable clusters), and (iii) it shows better stability of the clusters in training and testing datasets. Besides, novel evaluation mechanisms were proposed in our method, such as (iv) the incorporation of prior knowledge in the kernel proposal, (v) the incorporation of normalization methods to avoid the non-uniqueness problem, (vi), and the use of metrics to determine a set of hyperparameters useful in cluster interpretation.
We used synthetic data and biological data to demonstrate these advantages. In all methods, the KjNMF outperformed standard jNMF methods concerning cluster stability and interpretability metrics, i.e., the proposed method allows to reduce the non-uniqueness problem, facilitating the interpretation by finding new clusters. In addition, we used different methods to normalize the low-rank matrices. Finally, we found an excellent approach to reduce the noise included in real-world data to use the l 2 norm. Therefore, our approach allows understanding complex data that can be integrated by increasing the interpretation value.
DIEGO SALAZAR received the B.A. degree in pharmacy from the National University of Colombia, in 2011, and the MBiolSci degree from Pontifical Xavierian University. He is currently pursuing the Ph.D. degree in engineering. He currently works as a Lecturer in statistics with the Industrial Engineering Department, University of Los Andes. He has expertise in statistical learning, linear statistical models, and data mining for decision-making. Also, he has worked in clinics and the pharmaceutical industry. His research interests include causal inference, multi-modal machine learning, kernel-based models, and cancer.
JUAN RIOS received the degree in industrial engineering, with a minor in computer science, from the University of Los Andes, in 2020. Throughout his career, he was a mentor on subjects, such as statistics and programming. Since 2019, he has worked for a global consulting firm as a data and customer behavior analyst for multiple worldwide companies.
SARA ACEROS received the degree in industrial and biomedical engineer and the master's degree in industrial engineer from the University of Los Andes. She was a Graduate Teaching Assistant in statistics courses at the University of Los Andes. Her research interests include machine learning, statistics, and mathematics methods to integrate and analyze biological data.
OSCAR FLÓREZ-VARGAS received the M.Sc. degree in biochemistry from the National University of Colombia, in 2008, and the Ph.D. degree in computer science from The University of Manchester, U.K., in 2016. He is currently a Postdoctoral Fellow at the National Cancer Institute, USA. His research interests include computational analysis of genotype-phenotype relationships to further evaluate the contribution of specific genomic regions, genes, and genetic variants to cancer phenotypes.
CARLOS VALENCIA received the M.Sc. degrees in statistics and engineering and the Ph.D. degree in industrial engineering, with a focus on statistics, from the H. Milton Steward School of Industrial and Systems Engineering, Georgia Institute of Technology. He is currently an Associate Professor with the Industrial Engineering Department, University of Los Andes. His research has been published in different journals related to statistics, data analysis, and simulation. His research interests include applied statistics and the use of regularization mechanisms for functional estimation in complex and high-dimensional settings. VOLUME 9, 2021