Dual Graph-Laplacian PCA: A Closed-Form Solution for Bi-clustering to Find "Checkerboard" Structures on Gene Expression Data

—In the context of cancer, internal "checkerboard" structures are normally found in the matrices of gene expression data, which correspond to genes that are significantly up-or down-regulated in patients with specific types of tumors. In this paper, we propose a novel method, called dual graph-regularization principal component analysis (DGPCA). The main innovation of this method is that it simultaneously considers the internal geometric structures of the condition manifold and the gene manifold. Specifically, we obtain principal components (PCs) to represent the data and approximate the cluster membership indicators through Laplacian embedding. This new method is endowed with internal geometric structures, such as the condition manifold and gene manifold, which are both suitable for bi-clustering. A closed-form solution is provided for DGPCA. We apply this new method to simultaneously cluster genes and conditions (e.g., different samples) with the aim of finding internal "checkerboard" structures on gene expression data, if they exist. Then, we use this new method to identify regulatory genes under the particular conditions and to compare the results with those of other state-of-the-art PCA-based methods. Promising results on gene expression data have been verified by extensive experiments.


INTRODUCTION 1.Biological analysis of PCA
ith the development of molecular biology, the gene chip has become one of the most important technologies of gene functional annotation in the post-genomic era [1].Determining how to excavate reliable information from the high-throughput and multivariable gene chip to explain the regulatory network of gene function is the bottleneck problem of bioinformatics [2].Without losing the original data, principal component analysis (PCA) transforms the data to a low-dimensional linear or nearly linear subspace constituted by principal components (PCs) [3].This method overcomes the limitations of bioinformatics methods in gene chip analysis and provides new inspiration for biological data mining.For example, the selected information simplifies the complexity of the gene chip variable and clusters the obtained data.This method provides the basis for early diagnosis and subtyping of cancer.

Checkerboard structures in gene expression data and relations with PCA
In the absence of class knowledge of genes and samples, it is necessary to find potential classes by exploiting the relationship between genes and conditions.Bi-clustering exploits the potential two-sided data structure, which helps the two-dimensional clustering make meaningful study of genes and samples.This method has achieved better results than using single-dimensional clustering to cluster conditions or features independently [4][5][6][7].Since the gene expression data generated by gene chip technology are expressed as a "high-dimensional smallsample" matrix, we assume that there are "checkerboard" structures in these data, which is reasonable and effective [8].Bi-clustering can be used to find the checkerboard structures hidden in a gene expression data matrix, which has been well studied [8].Specifically, bi-clustering is performed in both the row and column directions simultaneously, which interact with and restrain each other, to identify the checkerboard structures within the gene expression data, if they exist.These checkerboard structures are formed by the network of genes and conditions (e.g., different samples).In the context of cancer, these structures are associated with significantly up-regulated or down-regulated genes in patients with specific types of tumors [8].
The raw gene expression data matrix is presented graphically in Fig. 1.In this matrix, the rows represent genes and the columns are different experimental conditions (e.g., different samples).Under this assumption, the matrix in Fig. 1 could be reorganized in a framework with a checkerboard-like structure.The various blocks in this structure are the strongly correlated genes (rows) over a subset of samples (columns).Medical researchers can develop personalized studies of different patients (samples) with a variety of regulatory genes.
PCA has been well applied in research and has yielded satisfactory results in clustering [9].Although the features selected by PCA retain the main information of the original variables, and this information is the main part of the variation, there are weaknesses to this approach.For example, using limited data to obtain more useful information is one of the biggest bottlenecks.When PCA is applied for dimension reduction, it is essential to introduce manifold learning to learn the internal geometric structure.It has been proven that the incorporation of manifold learning into PCA facilitates the clustering effect [10].Manifold learning finds the lowdimensional structure in high-dimensional data, which reveals the nonlinear geometric structure within the data [11].The cross-application of manifold learning and other techniques has yielded satisfactory results [10,12].Nonlinear manifold learning algorithms include Laplacian Eigenmaps (LE), Isomap, and Locally Linear Embedding (LLE) [13][14][15].For example, Jiang et al. proposed graph Laplacian PCA (gLPCA) and its robust model (RgLPCA), which introduced manifold learning into PCA [10].Additional improved models based on gLPCA have been proposed, and good clustering and feature selection results have been obtained [16].
The methods mentioned above only focus on one-way clustering, which is solely based on genes or samples and ignores the relationship between them.Especially for high-dimensional, sparse and noisy data, it is difficult to meet the accuracy requirements in practice.As of now, there are more concerns about bi-clustering, which has achieved better results than one-way clustering.Biclustering clusters the rows and columns at the same time, which assist and restrict each other, and is also effective on high-dimensional and sparse data.In this paper, we incorporate manifold learning into the PCA model both in the principal directions (gene manifold) and along the PCs (condition manifold) to consider the internal structure of the data.In this way, the chessboard structures inside the observation data are constructed.

Uncovering checkerboard structures through dual graph-regularization PCA
Motivated by recent progress in the PCA method and biclustering [17,18], we propose a novel method called dual graph-regularization principal component analysis (DGPCA).This method simultaneously considers the internal geometric structures of the condition manifold and the gene manifold.The geometric structures of the sample and gene spaces are encoded by constructing two nearest-neighbor graphs.To summarize, the main contributions of this paper are as follows: 1.We propose a novel PCA method named DGPCA.
This method simultaneously considers the internal geometric structure information contained in both condition and gene data.2. We present a closed-form solution for this problem and design an algorithm to address it, which avoids the instability of the iterative algorithm.3. A visual checkerboard structure is found by the proposed method in combination with bi-clustering in the observed data.This structure corresponds to the genes that are significantly up-regulated or down-regulated in patients with certain types of tumors.4. To detect new "marker genes" in the checkerboard structures, we mine the genes that are strongly regulated under the particular "conditions".These regulatory genes are more effective than graph-PCAbased methods.DGPCA provides a tool that is helpful for the study of the pathogenesis of cancer.The rest of this paper is organized as follows.First, several related works are introduced in Section 2, including graph Laplacian PCA (gLPCA) and its robust models.Then, the method of DGPCA is first formulated in Section 3. The closed-form solution of this problem is also given in this section.Comprehensive experiments are carried out to evaluate the DGPCA method in Section 4. Finally, the paper is concluded in Section 5.

RELATED WORK
Before we present the details of our method, some terms and notations are listed in Table 1, which will be frequently used in the following section.Then, we review some works that are related to this paper.
PCA finds the k -dimensional linear subspace where the projected data are as close as possible to the original data [10].To provide an embedding for the data lying on a non-linear manifold, graph Laplacian PCA (gLPCA) is proposed [10].The main task is to study the data matrix X that incorporates the cluster information into the graph data W .This aim can be achieved by solving the following problem: ., where γ is the parameter that balances the contributions of the two terms;   L D W is the graph Laplacian matrix, where ii j ij   D W is a diagonal matrix whose elements are column or row sums of W ; and W is the weight matrix containing the edge weights of the graph with n nodes.The definition of ij W can be expressed as follows: where ( ) N x is the set of the k -nearest neighbors and edges connecting each data point i x in the graph.
The error function of gLPCA minimizes the sum of the squares of the data points that cause the data to deviate significantly from the model.To reduce the influence of outliers and noise, various robust versions of gLPCA have been proposed.These robust versions are formulated as follows: , min Tr( ) . ., where z can be the L2,1-, L1/2-or P-norm.The L2,1-norm is defined as ]; the Pnorm is more flexible and can be tuned from 0 to 1 by a proximal operator: t t where t is a vector and α is the tuning parameter [20].
Good results can be obtained by these methods in feature selection and clustering.The robust versions utilize the L2,1-, L1/2-and P-norm in their error functions.

Construct sample and gene graph
Recent research has shown that both the observed samples and genes lie on nonlinear low-dimensional manifolds, namely, the sample manifold and gene manifold, respectively [18].Thus, we introduce two graphs to model the internal geometric structures of both the sample manifold and gene manifold.More specifically, we construct two graphs with different dimensions, namely the sample graph and gene graph, to explore the internal geometric structures of the rows and columns in gene expression data.The k -nearest-neighbor sample graph whose vertices correspond to   x is first constructed.Following previous research, we use the 0-1 weighting scheme to construct the k -nearest-neighbor data graph [10].The sample weight matrix can be defined as follows:   :, :, ..., , 0, , is the k -nearest neighbor of is a diagonal degree matrix.Similarly, we also use the 0-1 weighting scheme to construct the k -nearest-neighbor gene graph.

Objective function of dual graph-regularization PCA
Based on the graph regularizations of the sample manifold and gene manifold, we propose a new dual graph-regularization PCA method, with an objective function that is formulated as follows: . ., where α and β are parameters that balance the contributions from the reconstruction error of DGPCA in the first term and the graph regularizations in the latter two terms.When 0 α  , DGPCA degrades to the gLPCA method, and when 0 α β   , DGPCA degrades to the standard PCA method.

Closed-form solution of DGPCA
We present a closed-form solution to the problem.The instability of the iterative solution can be avoided by our method.The objective function can be rewritten as follows: First, by computing the optimal U while fixing V , we can obtain the following results: Thus, the optimal solution of U is given by Here, By some algebra, we have Tr Tr .
As a result, Eq. ( 10) is equivalent to the following: where Therefore, the optimization problem can be solved by the eigenvectors corresponding to the k smallest eigenvalues of the matrix B .

EXPERIMENTS
The primary goal of the experiments is to evaluate the proposed DGPCA method in comparison with gLPCA because DGPCA incorporates the gene manifold based on gLPCA.For completeness, we also compare our results to existing research results and the results of some other graph-Laplacian-PCA-based methods, such as RgLPCA [10], L1/2gLPCA [19] and PgLPCA [20].
These methods focus on one-way clustering and only learn the sample geometry with the PCA method.First, we present a visual heat-map to display the results of biclustering to find "checkerboard" structures, if they exist.
Then, experiments on selecting regulatory genes are presented to evaluate the performance of DGPCA compared with other methods and the existing research results.Biological analysis of these genes provides the basis for further research on new cancer markers.The experimental datasets and the parameter settings for each method are described in the following subsections.

Datasets
The datasets used in these experiments are described as follows: Leukemia data: The leukemia data consist of a matrix that includes 38 samples and 5000 genes.This dataset is publicly available at https://sites.google.com/site/feipingnie/file. It contains 11 types of acute myelogenous leukemia (AML) and 27 types of acute lymphoblastic leukemia (ALL), and ALL is divided into T-and B-cell subtypes [21].Colon cancer data: The colon cancer data consists of a matrix that includes 2000 genes and 62 tissues.These tissues are divided into 22 normal and 40 colon tumor samples [22].This dataset and its detailed description are publicly available at http://genomicspubs.princeton.edu/oncology/affydata/index.html

Experimental setting
For each method, all the parameters are tuned to search for the optimal value.Since the parameter γ on gLPCA, RgLPCA, and L1/2gLPCA, and the parameter P on PgLPCA can be tuned in the range of 0~1, we search for their optimal values in [0 : 0.1 :1] .Within the given range, the greater the value of γ , the greater the role of the graph Laplacian in the objective function.According to previous research, we set 0.5 γ  to obtain fair results [10,19,20].In practice, we set the parameter 1.2 ρ  in RgLPCA and L1/2gLPCA.A wider range has also been investigated, but 1.1 ~1.5 ρ  yields good results.Since the parameters α and β of the proposed method have no special limits, we search for their optimal values in   .We set 0.5 β γ   to obtain the condition manifold because both the parameters β and γ control the contribution of the graph Laplacian in the samples.In practice, we find that when 0.05 α  , satisfactory results can be obtained.Based on the number of categories of the experimental data, we set the numbers of reduced dimensions 1 3 k  and 2 2 k  for leukemia data and colon cancer data, respectively.

Bi-clustering results to find "checkerboard"
structure Our proposed method provides a geometric structure of not only the condition manifold but also the gene manifold.To assess our method, it is useful to observe how well it performs on several gene expression datasets, with respect to achieving the goal of finding checkerboard structures.Since the previous PCA-based methods ignore the joint geometric structures of conditions and genes, they are not designed to reveal the checkerboard structures of gene expression data.Therefore, only the proposed method DGPCA is evaluated through bi-clustering and the visual heatmap.We use bi-clustering as a tool for data visualization and reasonable interpretation of our method.Our method employs manifold learning schemes that highlight the internal geometric structures of both genes and conditions, thereby directly revealing the degree of biclustering.
We apply the proposed method to two publicly available datasets: leukemia data and colon cancer data.
The visual heatmap is used to display the results of biclustering to find checkerboard structures, if they do exist.The heat maps of (a) and (b) in Fig. 2 display the biclustering results of DGPCA on the leukemia and colon cancer data, respectively.In this figure, on the left is the checkerboard structure of the leukemia data, where each column corresponds to a sample; in the center are the principal directions (gene manifold); and on the right are the projected samples in the new subspace (condition manifold).The two coordinates represent the sample number and gene expression level.From Fig. 2, we can observe that the two raw datasets are rearranged in a checkerboard structure.In heat map (a), the arrangement of the 38 samples is generally based on the three types of labels: AML, T-and B-cells.A similar conclusion can be drawn from the colon data in heat map (b).The 62 tissues are generally arranged according to two types of labels because these tissues are divided into normal and colon tumor samples.The cross-section of the different colors shows the interaction between the samples and the genes.Specifically, blocks of different colors represent the clustering results of different data classes under the interaction of the gene and condition manifolds.These consistently formatted graphs show the checkerboard structure of each data class on the condition manifold  together with the gene manifold.

Finding regulatory genes under the particular "conditions"
Since bi-clustering is used as a tool for data visualization and interpretation, it is natural to detect the quality of biclusters in terms of biological significance or accuracy.
Here, we perform a study to assess the quality of our method, in which we apply DGPCA to the experimental data used in this paper.The top-100 regulatory genes are selected from both the leukemia and colon cancer data for analysis.First, we rank the scores of all genes in descending order.Then, the regulatory genes can be extracted by the corresponding indices.In other words, the extracted top genes with high scores can be deemed regulatory genes.DAVID 6.7 is used as a tool to find the official names of the selected regulatory genes from the leukemia data; it is publicly available at https://davidd.ncifcrf.gov/.For the colon cancer data, we search for the abbreviation of each gene using ToppGene Suite, which is publicly available at https://toppgene.cchmc.org/enrichment.jsp.We download the pathogenic gene pools for leukemia and colon cancer from GeneCards.This searchable, integrative database is publicly available at http://www.genecards.org/.Assume the selected genes match the pathogenic gene pool over the two experimental datasets.

Analysis of matching results
The matching results on the leukemia and colon cancer data of gLPCA, RgLPCA, L1/2gLPCA, PgLPCA and DGPCA are listed in a separate file.In this file, regulatory genes are selected by all compared methods, where the marked genes denote peculiar pathogenic genes selected by our method but not by other methods.The relative scores of each regulatory gene associated with the disease are also listed in this file.To determine the efficiency of the identified regulatory genes, we summarize the total relevance scores (TRS), accuracies (ACC) and average relevance scores (ARS) in Table 2.The best results are highlighted in bold.ACC is the accuracy of the regulatory genes from the selected top-100 genes; it is defined as follows: where i q is a regulatory gene that was selected by our method, and i p is a pathogenic gene of the disease.
x y δ x y otherwise where   i map q is the mapping function.A higher ACC values indicates improved performance.From these tables, we make the following observations: 1.The lowest TRS are obtained by PCA, on both the leukemia and colon cancer data.This is reasonable, since classical PCA is not robust enough.2. By considering the internal geometric structures, gLPCA achieves some improvement over classical PCA. 3. RgLPCA, L1/2gLPCA and PgLPCA also aim at improving the robustness of the algorithm.PgLPCA outperforms the others because this method provides the utmost flexibility, since the value of P can be tuned from 0 to 1. 4. The ACC result of DGPCA on the colon cancer data is the highest and that on the leukemia data is the lowest.The large amount of noise in the leukemia data might be the main cause, which leads to the ACC result of DGPCA being less than those of robust methods such as RgLPCA, L1/2gLPCA and PgLPCA.These methods are designed to improve the robustness and reduce the effects of noise and outliers.However, the TRS and ARS values obtained by DGPCA are the highest on both datasets because  DGPCA picked out some important genes that were ignored by the other methods.

Visualization of overlapping results
Here, we utilize a Venn diagram to visualize the overlap among the regulatory genes selected by the compared methods.It was obtained using OmicsBean, which is a multi-omics data analysis tool, which is available at http://www.omicsbean.com:88/.The Venn diagram in Fig. 3 shows the following overlapping results: (a) is the overlapping result on the leukemia data and (b) is the overlapping result on the colon cancer data.The different permutations and combinations of all methods are displayed in the left coordinate of the Venn diagram.
When there is only one method on the left coordinate, the corresponding number on the right represents a regulatory gene that is obtained only by this method.
When there are several methods on the left coordinate, the corresponding number on the right indicates are regulatory gene that is obtained by these methods.
As shown in Fig. 3, the proposed method selects the largest number of regulatory genes that are not selected by other methods.The numbers of such unique regulatory genes from the two datasets that were excavated by the proposed method are 18 and 11.According to the relevance scores in the additional file, these genes are highly related to disease and cannot be neglected in the study of leukemia and colon cancer.In contrast, few unique regulatory genes are excavated by PgLPCA and gLPCA.Since the relevance scores of these genes with respect to disease are not high, no further research on them will be conducted in this paper.

Comparison with published results
To evaluate the promising results of DGPCA, we compared all related genes, such as biomarker or characteristic genes, that were obtained by other methods in the literature, to the regulatory genes found in this paper.We found that 20 of the 30 genes identified by the P-norm Robust Feature Extraction (PRFE) method are related to leukemia and the ARS is 6.56 [23].Liu et al. identified the feature genes from leukemia data by combining RPCA and LDA [24].Eleven genes were identified as characteristic genes associated with leukemia, and the highest relevance score was 19.06.The average and highest scores of the characteristic genes selected by RGNMF from the leukemia data were 7.31 and 25.99, respectively [25].Among all the biomarkers selected by Wu et al., 29 out of 52 genes selected by our method can be found in this article [26].However, the 23 genes they ignored include several important pathogenic genes, which provide evidence that our method outperforms the method in the previous study.In particular, some important genes with high relevance scores were not excavated in this paper.Some researchers have made outstanding contributions to the discovery of colon-cancer-associated genes, but most of the regulatory genes found in this paper have been neglected [27,28].These genes are considered to be new oncogenes and have research value for leukemia and colon cancer.
Further studies of these genes are conducted in the next subsection.

Function analysis of unique regulatory genes
Some genes acquired by DGPCA that have been ignored by the existing methods are important contributions to the study of related cancers.Thus, further analysis of these genes is necessary [26].We list detailed information on the unique regulatory genes with relevance scores greater than 10 in Table 3 and Table 4.These genes will facilitate the study of leukemia and colon cancer in clinical practice.The functions and relevance scores of these genes have been summarized.Some of the results from these tables and from other research are summarized as follows: 1. Far more unique regulatory genes are identified by DGPCA from the leukemia data than from the colon cancer data.This large disparity is due to the distinct attributes of different datasets.For example, the labels of the leukemia data are divided into three types, whereas those of the colon cancer data are divided into two types.2. In Table 3, FLT3 has the highest relevance score with respect to leukemia in its pathogenic gene pool, and other methods do not identify this gene.Various published articles have studied the relationship between FLT3 and leukemia, which indicates that it is an important gene in leukemia research [29,30].FLT3 is a gene that cannot be ignored in the study of leukemia, and it highlights the accuracy of our method.The relevance score of MYC with respect to leukemia is 42.66.GeneCards indicates that acute lymphocytic leukemia, acute lymphoblastic 3 and Burkitt lymphoma are the diseases associated with this gene.Neither of these genes can be ignored in the study of leukemia, and they are not found by other methods.Other genes in Table 3 also have certain degrees of correlation with leukemia [31,32].These important genes are missing from the 210 leukemiarelated biomarkers identified by Wu et al. [26].It is obvious that our results better enhance the clinical studies of leukemia, compared to those of other methods.The high relevance scores reflect the close relationships of CCNA1 and MS4A1 with leukemia.These results provide a great research space for us to study these genes, because there is little biological research on this topic.3.In Table 4, the relevance scores of GSTM1 and SRC with respect to leukemia are 49.89 and 47.03, respectively.Many investigators have conducted studies on the regulation of these two genes in colon cancer [33].Diseases associated with these genes include colon cancer and lung cancer.It has been proven in published articles that there is a certain relationship between them [34,35].The above results demonstrate that our method achieves a more accurate performance than earlier methods.

Gene interaction of biological pathway analysis
In addition to analyzing the selected regulatory genes for assessing the quality of bi-clusters, it is natural to relate those genes together with conditions to biological pathways.We send these selected regulatory genes to KEGG (http://www.kegg.jp/) to find the biological pathways.KEGG is a biological database resource for realizing high-level functions and applying biological systems.This database provides new perspectives on genomes, biological pathways, diseases and drugs [36].
The pathway graph of two datasets can be found in separate file, where the genes in pink are human disease genes, genes in blue are drug target genes and green are human genes.
The biological pathways of the hematopoietic cell lineage record the change process of hematopoietic stem cells (HSC).These cells can undergo self-renewal or differentiation into a common lymphoid progenitor (CLP) or a common myeloid progenitor (CMP).A CLP gives rise to the lymphoid lineage of white blood cells or the natural killer (NK) cells of leukocytes on the T and B lymphocytes.This process increases the production of platelets and clotting.Therefore, cellular stages are determined by the specific expression states of these genes.The interaction of genes selected by our method precisely describes the work of the hematopoietic system and is also the major cause of leukemia.The pathways of ribosome process and translate genetic information.Kimura et al. conducted studies on the ribosome and colon cancer cell lines to demonstrate the role of the ribosome in some biological processes [37].The above two biological pathways reflect the interaction of genes in the corresponding datasets.

CONCLUSION
Following the idea of bi-clustering and the relevance of rows and columns based on gene expression data, this paper presented a novel method called DGPCA.This method incorporated the information obtained by the gene manifold to improve the clustering of conditions in the model.In particular, the gene and condition manifolds could be simultaneously obtained by gene clusters and tumor clusters.The visual heatmap displayed the results of bi-clustering to find "checkerboard" structures.In situations where the checkerboard structure is found, the regulatory genes are selected compared with those obtained by other PCAbased methods and those in published articles to evaluate the quality of the bi-clusters.The identified regulatory genes have been analyzed in terms of function and coexpression (pathways).DGPCA provides a tool that is helpful for the study of the pathogenesis of cancer.

Fig. 1 .
Fig. 1.Heat map of identifying checkerboard structures associated with feature genes and certain conditions.Left: the raw gene expression data, where each column corresponds to a sample.Right: the shuffled matrix containing checkerboard structures of conditions with feature genes.Notes: All the original figures in this paper can be obtained in the separate file.
and solve for the optimal V .The objective function becomes

Fig. 2 .
Fig. 2. Heat map of DGPCA bi-clustering result on leukemia (a) and colon cancer data (b).Left: the checkerboard structure of the leukemia data, where each column corresponds to a sample.Center: the principal directions (gene manifold).Right: the projected samples in the new subspace (condition manifold).Notes: All the original figures in this paper can be obtained in the separate file.

Fig. 3 .
Fig. 3. Overlap among the differentially expressed genes identified by the compared methods.Notes: All the original figures in this paper can be obtained in the separate file.

TABLE 1 .
SOME NOTATIONS USED IN THIS PAPER FAthe Frobenius norm of the matrix A

TABLE 3 .
THE DETAILED INFORMATION OF THE PECULIAR GENES ON LEUKEMIA DATA SELECTED BY OUR METHOD, INCLUDING GENE FUNCTION, ASSOCIATED DISEASE AND RELEVANCE SCORE FOR LEUKEMIA.THIS TABLE LISTS ONLY GENES WITH RELEVANCE SCORES GREATER THAN 10.
CCNA1The protein encoded by this gene belongs to the highly conserved cyclin family, in which members are characterized by a dramatic periodicity in protein abundance throughout the cell cycle.14.29 MS4A1 This gene encodes a member of the membrane-spanning 4A gene family.10.65 CALR This gene acts as an important modulator of the regulation of gene transcription by nuclear hormone receptors.10.1

TABLE 2 .
RESULTS ON TOTAL RELEVANCE SCORES (TRS) OF GLPCA, RGLPCA, L1/2GLPCA AND DGPCA.THE BEST RESULTS ARE HIGHLIGHTED IN BOLD.

TABLE 4 .
THE DETAILED INFORMATION OF THE PECULIAR GENES ON COLON CANCER DATA SELECTED BY OUR METHOD, INCLUDING GENE FUNCTION AND RELEVANCE SCORE WITH COLON CANCER.THIS TABLE LISTS ONLY GENES WITH RELEVANCE SCORES GREATER THAN 10.
SRCThe protein encoded by this gene is a tyrosine-protein kinase whose activity can be inhibited by phosphorylation by c-SRC kinase.47.03