Overlapping Clusters and Support Vector Machines Based Interval Type-2 Fuzzy System for the Prediction of Peptide Binding Affinity

In the post-genome era, it is becoming more complex to process high dimensional, low-instance available, and nonlinear biological datasets. This paper aims to address these characteristics as they have adverse effects on the performance of predictive models in bioinformatics. In this paper, an interval type-2 Takagi Sugeno fuzzy predictive model is proposed in order to manage high-dimensionality and nonlinearity of such datasets which is the common feature in bioinformatics. A new clustering framework is proposed for this purpose to simplify antecedent operations for an interval type-2 fuzzy system. This new clustering framework is based on overlapping regions between the clusters. The cluster analysis of partitions and statistical information derived from them has identified the upper and lower membership functions forming the premise part. This is further enhanced by adapting the regression version of support vector machines in the consequent part. The proposed method is used in experiments to quantitatively predict affinities of peptide bindings to biomolecules. This case study imposes a challenge in post-genome studies and remains an open problem due to the complexity of the biological system, diversity of peptides, and curse of dimensionality of amino acid index representation characterizing the peptides. Utilizing four different peptide binding affinity datasets, the proposed method resulted in better generalization ability for all of them yielding an improved prediction accuracy of up to 58.2% on unseen peptides in comparison with the predictive methods presented in the literature. Source code of the algorithm is available at https://github.com/sekerbigdatalab.


I. INTRODUCTION
Peptides, a small sequence of amino acids, often interacts with proteins in cellular processes [1].One of the important peptide-protein interactions occur when a peptide binds to a Major Histocompatibility Complex (MHC) forming a peptide-MHC (pMHC) complex.pMHC is transported to the cell membrane where it is recognized by a T-cell in order to induce an immune response.Therefore, in pharmaceutical studies, validation of a pMHC binding with the drug of interest is crucial.However, this is a complicated process and computational methods are constantly being developed to The associate editor coordinating the review of this manuscript and approving it for publication was Chee Keong Kwoh.
support traditional empirical research to identify most likely candidates out of a library of thousands of peptides.Moreover, predictive models based on sequence-based methods are needed to approximate the binding affinities.
In recent years, the problem of binding affinity predictions became two-fold.Qualitative studies consider classifying binding predictions as 'binders' and 'non-binders' [2] or 'weak' and 'strong' binders [3]- [5] whereas quantitative studies allow real-value binding predictions [6].Lately, regression-based approaches have become more prevalent in sequence-based studies.A number of methods are used as predictors such as the partial least squares [7], random forests [8], support vector regression [9] and regularization methods [10].Nevertheless, the complexity of a biological system, diversity of peptides, and curse of dimensionality of amino acid index representation that characterise the peptides have adverse effects on the performance of peptide-binding predictive models.Moreover, uncertainties are prevalent in peptide binding affinity datasets due to imprecise or noisy measurements, and these datasets need to be analysed appropriately [11].There is still a lack of methods accounting for this aspect of peptide-protein bindings [12].
In certain applications, where the data is complex and non-linear, fuzzy systems are more tolerant of imprecise information and capable of modelling linguistic and numerical uncertainty.Moreover, they form a rule-based structure similar to human reasoning.Presently, type-2 fuzzy systems [13] have a wider use in real-world applications than ever before [14].They, in certain applications, perform better than type-1 fuzzy systems in terms of modelling and minimizing uncertainties [15]- [17].Type-2 fuzzy systems are preferred due to the consideration of membership functions being imprecise and being able to cope with the uncertainties associated with them.
In this paper, an overlapping clusters and support vector machine based interval type-2 Takagi Sugeno fuzzy system is proposed to address the aforementioned shortcomings of the sequence-based predictive models.A novel clustering framework is proposed in order to simplify antecedent operations for an interval type-2 fuzzy system.This clustering framework is based on overlapping regions between the clusters.The cluster analysis of partitions and statistical information derived from them have identified the upper and lower membership functions forming the premise part.This is further enhanced by adapting the regression version of support vector machines (SVR) in the consequent part [18].The computational demand in the defuzzification process is addressed by a method which has the closed-form representation.In addition, feature selection is used in order to reduce the high number of amino acid biochemical descriptors, representing a peptide, which formed the input scheme of the learning model.The prediction results indicate that the proposed model not only minimized the effects of uncertain continuous peptide binding affinities but also provided high precision in unravelling the binding affinities of unobserved peptides.
The remainder of the study starts with introducing the materials and methods (Section II).This section describes the identification of SVR based interval type-2 fuzzy system with overlapping clusters concept.Section III shows the results of the case study along with the discussion.Finally, concluding remarks are given in Section IV.

A. SUPPORT VECTOR-BASED INTERVAL TYPE-2 FUZZY SYSTEM
Type-2 fuzzy sets, which are defined through membership functions, are themselves fuzzy.However, the computations of type-2 fuzzy sets are complex and in order to ease these computations Interval Type-2 (IT2) fuzzy sets can be used [19].The Takagi Sugeno model is one of the widely used fuzzy systems [20].This model structure presents the design of consequent parameters using a linear function.The rule-base of the interval type-2 Takagi Sugeno fuzzy system with r rules can be expressed as: where, x 1 , x 2 ,. . ., x n represent the input vector and c 0 , c 1 , c 2 ,. . ., c n are the regression coefficients; IT2 fuzzy set is denoted by Ãi n for the variable n and rule i; and y i is the rule output.Type-2 fuzzy sets should be placed in the premise or consequent part (or both) in order to define a type-2 fuzzy system.IT2 fuzzy sets are characterized by the upper membership functions (UMFs) and lower membership functions (LMFs).This is how the uncertainty is modeled for the IT2 membership function.Bounded region between UMF and LMF is the footprint of uncertainty (FOU).Each interval type-2 fuzzy set within the footprint of uncertainty is unity.Three-dimensional representation of an interval type-2 fuzzy set is depicted in Fig. 1.The firing strengths of interval type-2 fuzzy system are determined by using the t-norm operator and can be calculated as: where is the lower (upper) membership degree for input variable x k ; respectively, and denotes the product t-norm operation.The output of an IT2 fuzzy system is obtained through type-reduction and defuzzification.The Karnik-Mendel algorithm is the widely used type-reduction method that can compute the left and right end points required for the IT2 fuzzy set [21].Then these end points are defuzzified to get the final output.Karnik-Mendel is an iterative algorithm and suffers from time intense computations.Therefore, alternate approaches have been presented in the literature [22]- [24].However, the proposed IT2 fuzzy system implements Biglarbegian-Melek-Mendel (BMM) method [25] which has the closed mathematical form as described in ( 4) where q and p are the parameters used to design the upper and lower weighted average of the rule consequents, respectively.
Recently, support vector machines are incorporated with interval type-2 fuzzy systems to identify the parameters of the consequent part [26], [27].The regression coefficients ( w and b) that weighs the linear SVR are obtained by the training samples.To incorporate SVR with the interval type-2 fuzzy system, the input for each data item as in ( 5) is transformed to (6).The coefficients of rule consequents ( w) and b are computed using the linear SVR.For this purpose, a library for support vector machines was used [28].Then, the output of support vector-based interval type-2 fuzzy system (y ) is obtained from ( 7) and ( 8).

B. IDENTIFICATION OF INTERVAL TYPE-2 FUZZY SETS WITH OVERLAPPING CLUSTERING CONCEPT
This section will introduce a novel method based on the overlapping clusters concept in order to initialise the interval type-2 membership function parameters.The FOU of an interval type-2 fuzzy set can be defined by varying either the mean (see Fig. 3) or the standard deviation (see Fig. 4) of the Gaussian membership function.As the overlapping regions between the clusters applicable to the latter approach, the footprint of uncertainty is formed with fixed mean and blurred standard deviations.Once the interval [σ 1 , σ 2 ] is determined, upper and lower Gaussian membership functions are obtained as follows: The issues that need to be considered during the system identification for a fuzzy system using clustering can be found in [29].We considered finding interval type-2 membership function parameters with clustering methods such as the soft clustering (e.g., fuzzy c-means clustering [30]) and the crisp clustering methods (e.g., hard c-means clustering [31], hierarchical cluster analysis [32]).Statistical characteristics of clusters are used to identify the membership functions.It is assumed that statistical information that characterises a crisp cluster will involve more knowledge to identify an interval type-2 membership function than the arbitrary initialisation.
After the cluster analysis was performed we used left, right end points and centre of each cluster to define its triangular membership function.Algorithm 1 outlines the steps for finding the end points and the centre of upper and lower membership functions using the overlapping clusters concept.The proposed overlapping clusters method derives the lower membership function from the provided upper membership function approach [33], [34].Fig. 5 illustrates how the interval type-2 fuzzy sets are formed based on the overlapping clusters as a single input-single output scheme.

C. PEPTIDE BINDING AFFINITY DATASETS
A peptide consists of an amino acid sequence with a size of approximately 10 residues long [35].Peptide fragments form binding with MHC class proteins as a cellular event.pMHC complexes are translocated to the membrane of the host cell where they meet T-cells.When receptors of the T-cell recognize pMHC complexes, they elicit an immune activity to happen.These immune activities range from cytotoxic killing to phagocytosis of the infected cell.One main difficulty for experimental peptide studies is that the amount of possible peptides that can bind for a particular MHC class molecule is extraordinarily large (≥ 500 billion) [36].However, understanding how peptide-MHC class molecule interactions work and finding their binding affinities are crucial for health studies.
The proposed approach has been tested using the peptide datasets that have been obtained from various papers [37]- [40].Each peptide dataset has been considered as a task and organized in training and test datasets [10].For Tasks III and IV, two separate testing datasets were used even though training dataset remained the same.Table 1 lists the characteristics of the peptide binding affinity tasks.Tasks I, III and IV consist of nona-peptides whereas Task II consists of octa-peptides.Table 2 depicts the statistics of the peptide binding affinity tasks.Sequence logo (position specific amino acid frequency) representations of peptide datasets are shown in Fig. 6.Amino acid feature databases such as the AAindex [41] and CISAPS [42], contain many physico-chemical and bio-chemical attributes of amino acids.Each descriptor in the amino acid feature database has twenty different numerical values along with their descriptions that correspond to  each amino acid.However, previous studies usually use 643 descriptors which are mostly selected from the amino acid feature databases.To be consistent, we have encoded each amino acid in a peptide with 643 descriptors as shown in Fig. 7.The number of total descriptors becomes 5144 (643×8) and 5787 (643×9) when octa-peptide sequence and nona-peptide sequence are encoded, respectively.

III. RESULTS AND DISCUSSION
This section presents the experimental results of overlapping clusters and support vector based interval type-2 fuzzy system that conducted on peptide binding datasets to predict the real value of affinities.The stages of the proposed interval type-2 fuzzy model are illustrated in Fig. 2. In our implemented fuzzy model structure, type-2 fuzzy sets are in the premise and rule consequents are crisp numbers.Interval type-2 fuzzy sets of the proposed approach are determined using the overlapping clusters concept.During the system identification process of the fuzzy rule base, membership function parameter values are characterized using different clustering methods.The statistics found at the end of the cluster analysis generated the upper and lower membership functions of the interval type-2 fuzzy model.Additionally, support vector regression is used to learn the parameters of rule consequents.SVR not only enhanced the learning capability of the proposed model but also decreased the effects of overfitting.For the defuzzification process, Biglarbegian-Melek-Mendel method, which has the closed-form representation was used.We used grid search in order to find the SVR and Biglarbegian-Melek-Mendel method design parameters for the proposed interval type-2 fuzzy system.
Blind validation experiments were implemented to reveal the accuracy performance of the proposed method.Each peptide in both training and testing peptide datasets are encoded into physico-chemical and bio-chemical descriptor vectors.Then, the descriptors were normalized using minmax scaling so that every descriptor varied in the range between 0 and 1.When there is a large number of features available, feature selection is often required in bioinformatics to get rid of irrelevant features, avoid overfitting and provide an improvement in model performance [43].
As the encoded feature set became large (≥ 5000), a feature selection method (multi-cluster feature selection) is considered to be used in this work [44].Multi-cluster feature selection is an unsupervised feature selection method that does not require labeled data and already used in many bioinformatics applications [45]- [47].We decreased the high-dimensionality from many thousands to a few hundred.We found 161, 247, 172 and 141 descriptors are adequate to preserve a model for Tasks I, II, III and IV, respectively.It is also found that amino acid polarity appeared in the selected features of Tasks I, II and III as being the most discriminative descriptor.
To be consistent in comparisons with similar prediction methods, the coefficient of determination (q 2 ) [48] and the Spearman rank correlation coefficient (ρ) [49] were used.Percentage improvement of the proposed model as compared to the models found in this research domain (I 1 %) and to our previous work (I 2 %) were computed as in (11).
Table 3 reports the training and testing prediction performances of the proposed method when hard c-means clustering (HCM), fuzzy c-means clustering (FCM), and hierarchical cluster analysis (HCA) were used to initialize the membership grades of the interval type-2 fuzzy sets.For all tested models, the number of clusters varied in the range between two and four.The best predictive accuracy performances are achieved with FCM (three tasks) and HCA (one task).As can be seen underneath the best models, their SVR (C and ) and Biglarbegian-Melek-Mendel method design parameters (q and p) were given.For all tasks, we trained SVR with a linear kernel to obtain the rule consequent coefficients of the proposed interval type-2 fuzzy system.
The correlation between measured and predicted real value binding affinities are shown in Fig. 8.The best models of the proposed method (overlapping clustering and support vector based interval type-2 fuzzy system) achieved higher accuracy and significant increase in prediction performance than the previously published methods [7], [8], [10] on unseen peptides as shown in Table 4.As compared to the best predictive methods (0.691, 0.746, 0.232 and 0.586) presented in the literature, the proposed method resulted in better generalization ability for all of them yielding an improved prediction accuracy of 4.1%, 1.3%, 58.2% and 12.5% for Tasks I, II, III and IV, respectively.Additionally, as compared to our previous work (support vector based type-1 fuzzy system) [12], the proposed method achieved an accuracy improvement in prediction performance of 3.3%, 1.8%, 18.4% and 2.5% for Tasks I, II, III and IV, respectively.
Defining fuzzy sets and the number of rules are the main concerns in structure identification of a fuzzy system.The formation of rules can be automated with the help of the cluster analysis where each partition maps to a fuzzy rule.In clustering, the parameter to indicate the number of clusters should be preset before the cluster analysis is performed.However, determining the exact number of clusters is a considerable difficulty.We performed a grid search to observe (from two to up to seven clusters) to see the tendency of groupings within the peptide binding affinity datasets.We found that mostly three clusters are the natural number of the grouping of peptide binding affinities when incorporated with the proposed interval type-2 fuzzy system.The number of clusters we found for the peptide binding affinity datasets also agree with the fact that the number of membership functions should be ≤ 7 in each input domain for the practical design of an interval type-2 fuzzy logic system [50].This magical number is based on a study [51] stating that keeping in mind more than 7 ± 2 objects at the same time becomes more confusing for a human and beyond his/her processing information capacity.
The utilization of overlapping aimed for overcoming the difficulties of parameter identification process in an interval type-2 fuzzy system.When required, interval type-2 membership function parameters can be further optimised using a learning algorithm [52].As the initialisation of membership functions depend also the parameter values of learning algorithms, the proposed initialisation process will eliminate this necessity and lead a learning algorithm to focus its ultimate purpose.
Finally, in this study we used Gaussian membership functions as they are relatively easy to implement and require less parameters, therefore have less assumptions.However, any type of membership functions could have been used and these will be implemented in future work and tested to see if they offer any improvements over our current method.

IV. CONCLUSION
This paper presents a robust hybrid system that incorporates an overlapping clustering concept and support vector regression for the design of an interval-type-2 fuzzy system.This is one of the first studies where a support-vector based interval type-2 fuzzy system is applied to a real bioinformatics problem.The performance and robustness of the proposed hybrid predictive models were demonstrated over one of the most challenging problems in molecular biology -the prediction of peptide binding affinity.The analyses on four different case studies in the prediction of peptide binding affinity have yielded better generalisation ability and higher predictive accuracy than those presented in the literature.This study has both biological and computational implications: the predictive model has yielded a number of useful biological characteristics of the peptides (e.g.amino acid polarity) which could help analysis of peptides with more appropriate binding affinities.In addition to the development of a robust predictive model with applications in high dimensional datasets (rare in fuzzy system-based studies), the study presents a successful implementation of the overlapping clustering framework in the design of an interval type-2 fuzzy system.As this framework can also help determine initial values of the interval type-2 fuzzy system, it could be further incorporated with any type of clustering, machine learning and optimisation methods to help further improve its outcome.Further research will be out towards this direction.

FIGURE 2 .
FIGURE 2. Stages of the proposed interval type-2 fuzzy system for the prediction of peptide binding affinity.

FIGURE 3 .
FIGURE 3. Footprint of uncertainty of an interval type-2 fuzzy set when the standard deviation is fixed and the center is blurred.

FIGURE 4 .
FIGURE 4. Footprint of uncertainty of an interval type-2 fuzzy set when the center is fixed and the standard deviation (std) is blurred.

FIGURE 5 .Algorithm 1
FIGURE 5. Illustration of overlapping clustering concept used to identify the end points of the interval type-2 membership function.FOU: Footprint of Uncertainty.UMF: Upper Membership Function.LMF: Lower Membership Function.

FIGURE 6 .
FIGURE 6.Sequence logo plots of Tasks 1-4.Training (left) and test (right) peptide datasets are represented in position specific amino acid frequencies.

FIGURE 7 .
FIGURE 7. Encoding of a peptide sequence as amino acid descriptors.A) octa-peptide amino acid composition B) nona-peptide amino acid composition.

FIGURE 8 .
FIGURE 8.The correlation between measured and predicted peptide binding affinities; the training set is the former and the testing set is the latter.

TABLE 1 .
Characteristics of the peptide binding affinity tasks.

TABLE 2 .
Statistics of the peptide binding affinity tasks.

TABLE 3 .
The prediction scores of the proposed method based on different clustering methods.

TABLE 4 .
Comparison of the results of the proposed method to reported in this research domain.