Identification of DNA N4-methylcytosine Sites via Multiview Kernel Sparse Representation Model

Identifying DNA N4-methylcytosine (4mC) sites is of great significance in biological research, such as chromatin structure, DNA stability, DNA–protein interaction, and controlling gene expression. However, the traditional sequencing technology to identify 4mC sites is very time-consuming. In order to detect 4mC sites, we develop a multiview learning method for achieving more effectively via merging multiple feature spaces. Furthermore, we think about whether the multiview learning method can improve the across species classification ability by fusing data of multiple species. In our study, we propose a multiview Laplacian kernel sparse representation-based classifier, called MvLapKSRC-HSIC. First, we make use of three feature extraction methods [position-specific trinucleotide propensity, nucleotide chemical property, and DNA physicochemical properties) to extract the DNA sequence features. MvLapKSRC-HSIC uses a kernel sparse representation-based classifier with graph regularization. In order to maintain the independence between various views, we add a multiview regularization term constructed by Hilbert–Schmidt independence criterion (HSIC). In the experiments, MvLapKSRC-HSIC is applied on six datasets, so as to compare with other popular methods in single-species and cross-species experiments. All experimental results show that MvLapKSRC-HSIC is superior to other outstanding methods on both single species and cross species. Importantly, MvLapKSRC-HSIC can identify a series of potential DNA 4mC sites, which have not yet been experimentally evaluate on multiple species and merit further research.


I. INTRODUCTION
N UCLEIC acid plays an extremely crucial role in cell life cycle, and various modifications take place on the bases that constitute nucleic acid, which are necessary for cell physiological activities.Modifications on RNA and DNA improve the coding heritability of bases by introducing additional chemical groups.At present, gene modification [1], [2] has been applied in various fields, which has a strong relation to the pathological process and functional changes of cells [3].Importantly, N4-methylcytosine (4mC), the amino group of DNA at the position of cytosine (C-4), is catalyzed by methyltransferase and plays an important role in chromatin structure, DNA stability, DNA-protein interaction, and controlling gene expression [4]- [7].
How to identify the 4mC modification sites has become a hot topic in recent years.The traditional method to identify 4mC sites is by single-molecule real-time (SMRT) sequencing technology [8].However, there are some problems in this method, such as high cost of money, long time, and so on.With the rapid rise of artificial intelligence technology in recent years [9]- [15], a large number of research works have been complemented recently based on the computation method to obtain the DNA 4mC sites [16]- [20].Chen et al. [21] put forward iDNA4mC, which used DNA sequence to construct features via nucleotide chemical properties [22] and nucleotide frequency properties, and applied SVM model to predict 4mC sites.He et al. [44] proposed 4mCPred, which constructed features by position-specific trinucleotide propensity (PSTNP) and electron-ion interaction pseudopotential [23] and introduced the location information and physical and chemical information of DNA sequences.They also used the SVM model to predict 4mC sites.Wei et al. [24] used K-mernucleotide frequency, mononucleotide binary encoding, and other methods to construct features, and used SVM to predict the 4mC sites to achieve good results.Manavalan et al. [25] proposed Meta-4mCpred, which used a metapredictor for 4mC sites prediction; Khanal et al. [26] and Liu et al. [27] all adopted deep learningbased approach for improving the prediction of DNA 4mC sites via convolutional neural network.Although the abovementioned methods have achieved good results, most of them basically splice multiple features into a feature vector, which can be used as the input data of the classification model.It makes the spliced features lose their original meaning.Moreover, when there are more features, the dimension of combined features will be greatly increased, which brings unnecessary difficulties to learning.
In order to solve the problem of high cost, long calculation time and give full play to the advantages of multiple feature spaces, we introduce the multiview learning method, which has achieved landmark development in recent years [28]- [32].The multiview learning means to using multiple representations of the same thing as "views" to problems modeling and solving.It overcomes the problem that traditional machine learning methods can only process data from a single view.Besides, reasonable integration of multiple views can give full play to the advantages between multiple views.Cao et al. [33] extracted information from different views of images, and then used HSIC to ensure the independence between views, which reflected good results in clustering problems.
Sparse representation-based classifier (SRC) proposed by Wright et al. [34] is one of the hot topics in machine learning research in recent years.Its hypothesis is that test samples can be linearly represented by training samples when there are enough training samples in the class of test samples.So far, SRC is widely used in face recognition [35]- [37] and other fields [38]- [40].Yin et al. [41] and Zhang et al. [42] improved the SRC method and proposed the kernel sparse representation-based classifier (KSRC) method to map the original feature space to the kernel space, which can well solve the nonlinear classification problem.
Inspired by multiview learning, we assume that using multiple representations of DNA as multiple views can enhance the classification performance under the original single view.Besides, to improve the classification performance, our model adopts the KSRC, using a small number of training samples to represent the test samples nonlinearly.In our study, we propose a multiview Laplacian kernel sparse representation-based classifier (MvLapKSRC-HSIC), which is a KSRC-based classifier combined with graph regularization, and in order to maintain the independence between various views, we add a multiview regularization term constructed by Hilbert-Schmidt independence criterion (HSIC), which was proposed by Gretton et al. [43].Besides, the original norm can be replaced by L 2,1 norm in order to predict test samples in batches.
Inspired by transfer learning, the across species experiment migrates the known source domain knowledge to the target domain knowledge.Therefore, we use the known species information as a training set to predict the target species.According to [44], the number of 4mC sites in different species is distinct, but their features have potential correlation.In order to explore the correlation information of features among species, it is of great significance to study cross-species experiments.Here, we think about whether the multiview learning method can improve the across species classification ability by fusing the data of multiple species as a training set and predicting on other species.Therefore, we conduct two experiments, one is 4mC site prediction on a single species using multiview learning, and the other is 4mC site prediction on a across species using multiview learning.All experimental results show that our proposed methods are superior to other outstanding methods on both single species and cross-species experiments.
Our contributions in this work are as follows.
1) We use kernel SRC to improve the prediction ability of 4mC site and the concept of multiview learning to integrate the information of multiple views, which also aim to enhance the prediction performance of 4mC site.
2) The introduction of L 2,1 norm can solve the prediction samples in batch;.3) Laplace regularization is also introduced to keep the sample smooth in the manifold space.4) HSIC is used to maintain the independence between different views, and an effective iterative algorithm is used to optimize the model.

A. Sparse Representation-Based Classifier
For the classification problem of C classes, suppose we have train samples X = [X 1 train , . . ., X c train , . . ., X C train ] ∈ R d×n , test samples Y ∈ R d×m , where d is the feature dimensions, n is the number of train samples, and m is the number of test samples.SRC aims at linear representation sparsely of each sample in the test set using samples from the train set as where y ∈ R m is a test sample vector in Y, a = [0, . . ., a c , . . ., 0] ∈ R n is the sparse coefficient vector that only associate with c-class.The sparse coefficient vector a can be obtained by minimizing the L 0 norm or L 1 norm of a [34].Then, we assign the predictive value as where re c (y) means the reconstruction error in class c of sample y, and the class with the smallest reconstruction error is the predictive label.

B. Kernel Sparse Representation-Based Classifier
KSRC is to solve nonlinear classification problem via mapping from original feature space to an appropriate feature space [41], [42].First, we map the feature space X to feature space F by function ϕ : X → F as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
The first term on the right-hand side of the (3) can be rewritten as Here, RBF function can be used to calculate the kernel as where the parameter γ denotes the bandwidth.

III. METHODOLOGY
A. Feature Representation 1) Position-specific trinucleotide propensity: PSTNP can extract the location information of DNA sequence.Assuming that a length of DNA sequence is L, PSTNP first extracts the 3-mer frequency matrix of positive and negative samples within 64 × (L − 2), denoted as Z + and Z − .Then, subtracting two matrices to obtain PSTNP profile F, which can be used to encode each DNA sequence as a vector with length (L − 2) [45].After that, the vectors composed of the whole original DNA sequence set are superimposed to obtain the feature matrix.In the 4mC prediction problem for cross species, multi view PSTNP provides the 3-mer frequency matrix of different species to encode multiple views of features.
2) Nucleotide Chemical Property (NCP): The nucleotides constituting DNA sequence are divided into four types, namely adenine (A), guanine (G), thymine (T ), and cytosine (C).According to different chemical properties have different nucleotide structures [46], NCP can classify four nucleotides according to the ring structure, forming secondary structure, and chemical functionality [21], the indicator functions are as follows: where I r , I s , and I c denote indicator of the ring structure, forming secondary structure, and chemical functionality, respectively.I r = 1 means the nucleotide has two rings {A, G}, and I r = 0 means the nucleotide has one ring {C, T }.The value of I s is determined by the hydrogen bond strength.When the nucleotide with strong hydrogen bond {A, T }, I s is set to 1; the value of I s is set to 0 with weak hydrogen bond {C, G}.I c is based on functional groups, when the nucleotide belongs to amino group {A, C}, I c is 1; when the nucleotide belongs to keto group {G, T }, I c is 0. NCP constructs a 3-D vector (I r , I s , I c ) for each nucleotide and then a DNA sequence of length L with a feature dimension of 3 × L.
3) DNA Physicochemical Properties (DPP): DPP is a feature vector constructed by dinucleotides.First, a series of DNA sequences with length L are sliding with step length 2 to obtain dinucleotides.Then for each dinucleotide, the normalized value for the 12 physicochemical properties can be encoded so that each strand of DNA obtains a (L − 1) × 12-dimensional feature vector [47].

B. Multiview Learning Model
In this section, we propose LapKSRC-HSIC method by combining multiview learning method and KSRC with Laplace regularization term.The overview is shown in Fig. 1.
1) Laplacian KSRC (LapKSRC) With L 2,1 Norm: Based on the KSRC method, the Laplacian regularization term [48], [49] is added in order to better preserve the association between DNA sequences.Besides, we use the L 2,1 norm instead of the L 1 or L 0 norm for solving entire test samples.The formula is as follows: where the third term of the ( 7) is a Laplacian regularization term, W ij is the similarity of samples i and j, and A = {a 1 , . . ., a m } is a sparse matrix of all test samples.In a previous study [50], the third term can be also represented as tr(A T LA), where L is the Laplacian matrix, which can be calculated by 2) Hilbert-Schmidt independence criterion: HSIC is a measurement used to determine whether variables in two different domains are independent [33], [43].Given n dependent samples where Ω u and Ω v are kernel matrices, in which , where I is the identity matrix and e is a unit vector with dimension n.
In order to make the current view complementary to other views, that is, the sparse representation of the current view is different from that of other views, we calculate the sum of HSIC terms in sparse coefficient matrix from vth view A v and other views A w (w = 1, . . ., V, w = v), and optimize them on the objective function.Besides, we ignore the constant term in HSIC for simplicity [33].To sum up, the formula is as follows: where Ω v = A v A vT and Ω w = A w A wT denote the inner product kernel of A v and A w , respectively, and Γ = V w=1,w =v HΩ w H.
3) Multiview LapKSRC-HSIC (LapKSRC-HSIC): Suppose we have V views of train samples {X 1 , . . ., X v , . . ., X V } and test samples {Y 1 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
, where v = 1, . . ., V and d v denotes the dimension of vth view feature vector, and the objective function is given as follows: where L v denotes the Laplacian regularization matrix on vth view, HSIC(A v , A w ) refers the "diversity" between vth and wth views, and λ, μ, and θ are the regularization parameters.

4) Solving Objective Function:
To solve (11), the sparse matrix under each view is optimize successively.Specifically, we optimize the sparse matrix A v from the vth view when other views are fixed.The objective function of vth view is as follows: Since ( 12) is a convex function, the optimal value of A v can be obtained by deriving (12) and making its derivative 0. For the derivation of the first term of ( 12), the numerator is transformed into the form of kernel function by using the kernel technique, and then the derivation is carried out as Then, the derivation of the second term in ( 12) is as follows: where Next, the derivative of the third term to A v is as follows: Therefore, the derivative of the HSIC term of vth view is as follows: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Algorithm 1: Algorithm of MvLapKSRC-HSIC.
Require: V views of train samples {X 1 , . . ., X v , . . ., X V } and test samples {Y Finally, we get the formula of A v as We use the optimized A to predict label with multiview data as: . ., m.Specifically, A is used to calculate the reconstruction error of training samples with different labels under each view.For each label c = 1, . . ., C, the minimum average reconstruction error is selected as the final predictive label of The process of MvLapKSRC-HSIC is listed in Algorithm 1.

IV. RESULTS
In this section, we first introduce the datasets used in our experiment.After we optimize the parameters, we carry out the following ablation experiments: constructing the views by using the features of DNA on a single species to evaluate the model performance under a single view and multiple views; the features of multiple species, which were extracted by PSTNP feature construction method, are integrated to construct multiview data, and tested on the datasets of other species.Finally, existing outstanding methods are compared with our multiview learning model.

A. Datasets
From Chen et al.'s [21] work, six species datasets are introduced: Arabidopsis thaliana (A.thaliana, an angiosperm), Caenorhabditis elegans (C.elegans, a kind of worm), Drosophila melanogaster (D. melanogaster, drosophila), Escherichia coli (E.coli, prokaryotes), Geobacter pickeringii (G.pickeringi, prokaryotes), and Geoalkalibacter subterraneus (G.subterraneus, prokaryotes), which are all extracted from MethSMRT [51].The sequence length of each species is 41 nt.As in [21], the data with modification QV score greater than or equal to 30 were obtained via the methylome analysis technical note, then the CD-HIT method [52] is used to remove the sequences with similarity higher than 80%.The number of six species is given in Table I.

B. Evaluation Measurements
For evaluating different models, four types of evaluation metrics are used, which are accuracy (ACC) [53], Mathew's correlation coefficient (MCC), sensitivity (SN) [54], and specificity (SP) [54] as follows: where N TP denotes the number of true positive samples, N TN denotes the number of true negative samples, N FP is the number of false positive samples, N FN is the number of false negative samples, and N total = N TP + N TN + N FN + N FP is the total number of samples.In addition, tenfold cross validation is used to calculate performance, which not only yields a reliable result, but also makes it suitable to compare with other models.

C. Selecting Optimal Parameters
As for the parameter selecting, our model has multiple parameters, namely, the parameters of RBF kernel function: γ, the regularization parameters: λ, μ, and θ.The grid search method is used to optimize the parameters, whose ranges are {10 −4 , 10 −3 , 10 −2 , 10 −1 , 5 × 10 −1 , 1}, as given in Table II.In addition, we assess the change trend of loss with the increasing Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE II PARAMETERS OF DIFFERENT VIEWS VIA GRID SEARCH METHOD ON MVLAPKSRC-HSIC
No HSIC regularization term for a single view. of iteration times, see in Fig. 2, and it is found that the loss tends to be stable around iteration 2. We use the difference between the current iteration error and the previous iteration error to determine the number of iterations.When the number of iterations is 5, the difference can be almost ignored (10 −14 ), so we set 5 as the number of iterations.

D. Performance of Single-View and Multiview Features
We first extract the PSTNP, NCP, and DPP features, and then divide them into two views according to the source of information.One is related to RNA location information, and the other is related to RNA physical and chemical information.In a single view, PSTNP, NCP, and DPP are taken as input, respectively.In multiple views, PSTNP is taken as the first view and the feature superposition of NCP on DPP is taken as the second view, constructing multiview data as input of MvLapKSRC-HSIC.The performance of different views are listed in Table III, where the As can be seen from Table III, the multiview feature (NCP-DPP & PSTNP) is not only higher than other single features (PSTNP, NCP, and DPP) in ACC, MCC, and SN, but also higher than feature combinations (NCP-DPP and NCP-DPP-PSTNP).On E. coil dataset, the ACC, MCC, and SN metrics are 0.0013, 0.0017, and 0.0015 higher than the feature of PSTNP, and the ACC and MCC metrics of multiview feature were 0.0811 and 0.1961 higher than feature combination (NCP-DPP-PSTNP), and the SN and SP metrics were 0.0328 and 0.1253 higher than feature combination (NCP-DPP-PSTNP), respectively.On G. pickeringi dataset, the MCC and SP metrics are 0.0096 and 0.010 higher than PSTNP feature, respectively, and the ACC, MCC, and SN of the multiview feature are 0.0387, 0.0459, and 0.0795 higher than those of feature combination (NCP-DPP-PSTNP), respectively; the SP metric is lower than the feature combination (NCP-DPP-PSTNP) by 0.0021.

E. Comparison With Existing Methods
We compare our proposed method with other existing models in this section, as given in Table IV, where the bold numbers indicate that the current method is better than others.Our proposed model has the highest ACC (0.819), MCC (0.627), and SN (0.833) values compared with other models on A. thaliana, which are 0.016, 0.007, and 0.13 higher than others, respectively.In addition, the SP value ranks second only after DeepTorrent.On C. elegans, our model performs best on ACC and SN, which are 0.861 and 0.871, respectively, higher than DeepTorrent 0.003 and 0.061, respectively, but MCC (0.718) and SP (0.853) ranked second.As for D. melanogaster, our model ranks first in ACC, MCC, and SN, which are 0.012, 0.012, and 0.05 higher than others.On E. coli, our model is superior to other methods in four evaluation metrics: ACC is 0.954, 0.081 higher than DeepTorrent, MCC is 0.911, 0.164 higher than DeepTorrent, and  [44].c Results are derived from [24].d Results are derived from [26].e Results are derived from [27].
SN and SP are 0.064 and 0.096 higher.For G. pickeringi and G. subterraneus datasets, our model ranks first in ACC, MCC, and SN, with 0.011, 0.019, and 0.073 higher on G. pickeringi, and 0.012, 0.021, and 0.077 higher on G. subterraneus.In summary, compared with other existing models, our proposed model has better prediction ability for 4mC sites.

F. Analysis on Cross Species
To better reflect the prediction ability of the proposed model, we carry out the following experiments in the prediction of 4mC sites across species.Specifically, one of the species is used as the train data, and the other species are used as the test data for prediction [27]; we call this a single-view cross-species experiment.Furthermore, in order to reflect the superiority of multiple views in prediction 4mC site of cross species, we construct multiview across species features, and the specific methods are as follows: in order to construct the multiview across species features of the current train species, we use the datasets of multiple species (except test species), including the train species to construct multiple PSTNP profiles, and then, the DNA sequences of the current train species are to construct features by these PSTNP profiles as multiview features.These PSTNP profiles are also used to build the test species features.
We analyze cross-species experiments based on a single view and multiple views, as well as the methods of DeepTorrent [27], iDNA4mC [21], 4mCPred [44], and 4mcPred-SVM [24], as shown in the following Tables V-X.Bold numbers mean that the current cross-species results are the best compared with other methods.In the comparison between single-view experiment and multiview experiment of our model (see Tables V and VI), the features of multiview have a better performance on crossspecies prediction.In addition, our multiview model has good predictive ability in cross-species prediction experiments for A. thaliana, E. Coli, D. melanogaster, G. pickeringi, and G. subterraneus.Especially when it came to predicting D. melanogaster, four of the other five species trained separately came out on top.Although the DeepTorrent method also shows good results for all six species, especially with fair ACC values for C. elegans, our multiview model's ACC values is not far behind the DeepTorrent.We can see that the multiview method has a good performance on cross-species prediction.Therefore, we find that different species have the same potential representation to some extent.The multiview method can learn this representation, so it has great potential in cross-species prediction.The results of multiview across species experiment are shown in Fig. 3, and each vertex of the hexagon represents the test sample, and each colored line represents the performance of the train set on each test sample.

V. CONCLUSION
To predict 4mC sites more effectively, we propose a novel machine learning method: a Laplacian KSRC algorithm based on multiview learning (MvLapKSRC-HSIC).To construct the multiview input data matrices, we use three feature extraction methods of PSTNP, NCP, and DPP to obtain the multiview Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.feature representation of DNA sequences, and then, train the MvLapKSRC-HSIC model to predict the new DNA sequence.Compared with other existing models in single-species experiments, our model has good prediction ability with the highest accuracy of our model on six species.In the cross-species experiment, our model combines the DNA sequence information of different species, and we can see that our model also has good ability to predict 4mC sites on cross species.Our model effectively improves the classification ability of DNA 4mC sites, but the current model only considers the fusion of PSTNP, NCP, and DPP features, which finally constructs two views.In future research, we will focus on the fusion features to build more than two views.Besides, we can also fuse multiple views from the perspective of how to select features for fusion.

Fig. 2 .
Fig. 2. Loss caused by the number of iterations on six datasets.

Fig. 3 .
Fig. 3. Radar map of cross-species experiments in the case of multiview of our proposed model (ACC%).
n and m are the numbers of train and test samples, d v is the dimension of feature in

TABLE III PERFORMANCE
OFDIFFERENT FEATURES BY OUR PROPOSED METHOD ON E. COLI AND G. PICKERINGI bold numbers indicate that the current feature or feature fusing methods is better than others, and the sign "-" refers to feature combination, that is, "NCP-DPP" means that the features of NCP and DPP are combined as one view, and "NCP-DPP-PSTNP" also means the same.The sign "&" represents view fusion, that is, NCP-DPP & PSTNP represents that the feature combined by NCP and DPP and the feature PSTNP are fused as two views.

TABLE IV COMPARISON
[21] OTHER EXISTING METHODS ON SIX DATASETSa Results are derived from[21].
bResults are derived from

TABLE V CROSS
-SPECIES PREDICTION RESULTS OF OUR METHOD (MULTIVIEW) IN TERMS OF ACC (%)

TABLE VI CROSS
-SPECIES PREDICTION RESULTS OF OUR METHOD (SINGLE VIEW) IN TERMS OF ACC (%)

TABLE VII CROSS
-SPECIES PREDICTION RESULTS OF DEEPTORRENT IN TERMS OF ACC (%)

TABLE VIII CROSS
-SPECIES PREDICTION RESULTS OF IDNA4MC IN TERMS OF ACC (%)

TABLE IX CROSS
-SPECIES PREDICTION RESULTS OF 4MCPRED IN TERMS OF ACC (%)

TABLE X CROSS
-SPECIES PREDICTION RESULTS OF 4MCPRED-SVM IN TERMS OF ACC (%)Authorized licensed limited to the terms of the applicable license agreement with IEEE.Restrictions apply.