Identification of Type VI Effector Proteins Using a Novel Ensemble Classifier

The type VI secretion system (T6SS) delivers effector proteins (Type VI secretion system effectors, termed T6SEs) into neighboring target cells. Many human pathogens express T6SEs, including Vibrio cholera, Burkholderia spp., and Pseudomonas aeruginosa. T6SEs play vital roles in the competitive survival and pathogenesis of bacterial populations. Several machine-learning methods are able to distinguish T6SEs from non-T6SEs. However, we believe there is room for further development. Therefore, herein we propose a more powerful ensemble predictor for identifying T6SEs. Initially, we construct a benchmark dataset from existing studies and databases. Then we use $k$ -separated-bigrams-PSSM (a type of feature encoding) to convert the protein sequences to mathematical vectors. A synthetic minority oversampling technique (SMOTE) is next employed to solve the training data imbalance problem. Finally, we employ a soft voting strategy to construct an integrated model combining six fine-tuned base classifiers. The model we propose performs excellently in terms of accuracy (ACC, 99.0%), Matthew’s correlation coefficient (MCC, 97.8%), sensitivity (SN, 97.1%), and specificity (SP, 100%) in independent testing.


I. INTRODUCTION
Type VI secretion is a recently described mechanism for protein transport across the cell envelope. The type VI secretion system (T6SS) is widely distributed in Gram-negative bacteria, and appears to inject toxins into competitor bacteria and/or eukaryotic cells [1]. T6SSs are phage tail spike-like injectisomes. The protein complex releases virulence factors into the extracellular medium and/or transports these factors directly into the target cells [2], [3]. The T6SS delivers multiple, diverse effector proteins directly into target cells using a dynamic ''firing'' mechanism related to the action of contractile bacteriophage tails [4]. The effector proteins delivered by the T6SS are named type VI secretion effector proteins (T6SEs), and play important roles in the competitive survival of bacterial populations, as well as in the pathogenesis of bacteria [5]. Based on known effector transport mechanisms, effectors can be classified as ''specialized'' or ''cargo'' effectors. Specialized effector proteins The associate editor coordinating the review of this manuscript and approving it for publication was Leyi Wei. are fused to the C-terminus of the T6SS structure, and include VgrG, and Hcp, and/or PAAR (Pro-Ala-Ala-Arg)domain-containing proteins, known to sharpen the VgrG spike. Cargo effectors interact directly or require a specific chaperone or adaptor protein to load into the lumen of the Hcp tube or VgrG spike [6]. T6SEs can function inside target cells, as well as extracellularly. Many human pathogens, including Vibrio cholera, Burkholderia spp., and Pseudomonas aeruginosa, express the T6SS. The T6SS has also been discovered in plant pathogens, Agrobacterium tumefaciens, Pectobacterium atrosepticum, and Xanthomonas oryzae, among others. The T6SS loci in Yersinia pestis (NCBI Gene accession YPO0499-YPO0516) plays a crucial role in phagocytosispromoting activity [7]. Although recent years have witnessed advances in the characterization of the role and function of these secretion systems, the number of identified T6SS effectors is quite limited, with almost all identified effectors being encoded with genetic linkage to the VgrG or Hcp locus in the T6SS main gene cluster or orphan VgrG/Hcp islands.
T6SEs can be experimentally identified by both discoverydriven and knowledge/hypothesis-based methodologies.
This includes bioinformatics analyses, the genetic analysis of T6SS-associated genes, proteomics-based methods, and mutant library screening [6]. Additionally, various online bioinformatics tools and databases are available. For example, the SecretEPDB [8] (Bacterial Secreted Effector Protein DataBase) is a knowledgebase for effector proteins of the type III, type IV, and type VI secretion systems. SecretEPDB provides detailed and experimental annotations of these effector proteins by manually extracting and integrating structural and functional information from currently available databases and literature in terms of protein characteristics, protein function, protein secondary structure, Pfam domains, metabolic pathways, and evolutionary details. SecReT6 is an integrated database providing comprehensive information on just the type VI secretion system in bacteria, including functional and genome-centric descriptions of T6SSs, T6SS components, cargo effectors, and immunity proteins according to sequence-derived data and findings reported in the literature [9].
Various attempts have been made toward in silico prediction of T3SEs, T4SEs, and T6SEs. Several predictive T3SEs computational models have previously been constructed [10]- [16], as well as for T4SEs [17], [18]. Bastion6 [5] was the first tool for T6 effector protein prediction, and is a two-layer support-vector machine (SVM)based ensemble model, which obtains 94.3% accuracy. Apart from Bastion6, Zalguizuri et al. and An et al. attempted to predict T6SEs, but got unsatisfactory results [19], [20]. Recently, PyPredT6, a python-based tool, was proposed to predict T6SS effectors using 837 features [21]. These 837 features were selected from a position specific peptide sequence profile (PSPSP), a position specific nucleotide sequence profile (PSNSP), a position specific secondary structure profile (PSSSP), a position specific solvent accessibility profile (PSSAP), and conjoint triad descriptors (CTD) by a randomized decision tree. PyPredT6 obtains 89.15% accuracy using the consensus of multi-layer perceptron (MLP), SVM, k-nearest neighbor (KNN), naïve Bayesian (NB), and random forest (RF) classifiers, in which the final class depends on the class predicted by the majority of the classifiers.
We think much space remains for improving T6SE prediction in view of the mediocre performance of Bastion6 and PyPredT6. Therefore, we developed and herein present a more advantageous machine learning method to identify T6SS effector proteins. The model is trained on a set of experimentally validated 138 T6SS effectors and 1,112 noneffectors, and tested on 107 T6SEs and 200 non-T6SEs. We demonstrate that our model can discriminate T6SS effectors from non-effectors effectively, with an accuracy (ACC) of 99.00%, a sensitivity (SN) of 97.10%, and a specificity (SP) of 100%, in independent testing.
Our method can be broken down into five main steps, as seen in Figure 1: (1) Collect protein sequences to construct a benchmark dataset for training and testing. (2) Convert the sequences to numeric vectors and find the most effective features. (3) Use SMOTE to balance the datasets. (4) Construct a powerful predictor that combines six base classifiers for final classification, and tune the parameters to improve the model's performance. (5) Evaluate the accuracy of the predictor via cross-validation and discuss the results.

A. DATA COLLECTION
The datasets consist of a training set for the creation of the model and an independent test set for testing the performance of the model. We used the same training set as Bastion6, which was originally collected from SecretEPDB. It contains 138 positive sample sequences and 1,112 negative, both of which have had homologous sequences removed at a threshold of 90% sequence identity by CD-HIT [22], [23]. To evaluate the performance of our model, we constructed an independent testing set containing 107 positive T6SEs (87 of which were collected from SecReT6 after removing homologous sequences at an identity threshold of 90%, and 20 from Bastion6) and 200 negative samples (all from Bastion6).

B. FEATURE EXTRACTION
Generally, the primary step in amino acid sequencebased bioinformatics problems is conversion of the protein sequence to a set of numerical features that can reveal intrinsic characteristics of the protein. Different feature descriptors reflect the diverse properties of proteins [24]- [28]. These properties roughly divide into the following categories: amino acid composition (AAC) [29]- [33], pseudo amino acid composition (PseAAC) [34]- [36], physicochemical properties, evolutionary and gene ontology information [37]- [39]. Feature extraction selection will directly influence the model's efficiency and precision. Therefore, selecting the most useful features for extraction is key to protein sequence classification problems. Furthermore, using characteristics of these features, we may be able to explain differences between T6SS effector and T6SS non-effector proteins and inspire further research.
In our work, evolutionary information, as represented by the position specific scoring matrix (PSSM) is extremely important, especially as characterized by k-separatedbigrams-PSSM, which are introduced in detail below.

1) The POSITION SPECIFIC SCORING MATRIX (PSSM)
Before introducing k-separated-bigrams-PSSM, it is necessary to understand the PSSM. A PSSM describes protein sequence evolutionary information. Evolutionary conservation can reflect important biological functions. A PSSM can be generated using PSI-BLAST [40] by searching for homologous sequences to a query protein in protein sequence databases. PSSMs have been widely used in bioinformatics [41]- [44]. We used three iterations against Swiss-Prot, with 0.01 as the E-value cutoff for our analyses [45]. The independent testing set contains 107 T6SEs and 200 non-T6SEs. In step 2, the raw protein sequences are converted to numeric vectors according to different mathematical formula. In step 3, SMOTE is employed to handle training set imbalance, which yields 1,112 T6SEs and 1,112 non-T6SEs. In step 4, six base models are integrated to construct an ensemble classifier using a soft voting strategy.
The PSSM of sequence P is represented by equation (1).
where in equation (1) E i→j represents the score of the amino acid residue at the i-th position of the protein sequence being changed to amino acid residue type j during the evolutionary process, where L is the length of the query sequence, and numerical codes 1, 2, ..., 20 represent the 20 native amino acid residues in alphabetical order. Because E i→j varies in a wide range, the following standardization is performed: where i = 1, 2, . . . , L, j = 1, 2, . . . , 20, and E 0 i→j represents the original scores calculated by PSI-BLAST.

2) THE k-SEPARATED-BIGRAMS-PSSM
The k-separated-bigrams-PSSM was proposed to model the relationships between two separated amino acids in a protein sequence [46]. The amino acid bigram probabilities are extracted from the sequential evolutionary probabilities in a PSSM. The algorithm can be summarized mathematically as below: where N is the PSSM representation for a protein sequence (1 ≤ m ≤ 20, 1 ≤ n ≤ 20 and k ≤ K ), and K usually is set at 11. The matrix N has L rows and 20 columns as equation (1). L is the length of the protein sequence. The k represents the distance between the amino acids, for example, when k = 1, the amino acids used to calculate the transition probabilities are separated by 0 amino acids, for k = 2, the amino acids used to calculate the transition probabilities are separated by 1 amino acid, and so forth. For each k, T (k) generates 400 features. The k-separated-bigrams-PSSM is visualized in Figure 2 below. In this paper, we extract k-separatedbigrams-PSSM from POSSUM [47], which is a Web server offering users a comprehensive and flexible generator for various kinds of PSSM-base descriptors.

3) SYNTHETIC MINORITY OVERSAMPLING TECHNIQUE (SMOTE)
In the area of computational biology, datasets often have more negative samples than positive samples [48]- [51]. Imbalanced datasets can lead to overfitting, and sample VOLUME 8, 2020 prediction can be biased towards the category with more samples. Random oversampling adopts the strategy of simply copying samples to increase the size of a few samples. However, the main idea behind SMOTE is to create new minority class samples by interpolating between several minority class examples that lie close together [52]. Specifically, for every minority sample, initially, its k nearest neighbors from the minority class are selected. Then, according to the oversampling rate, N neighbors should be chosen randomly from the k nearest neighbors. Finally, synthetic examples x new are created according to the following formula: where x k (j = 1, 2, . . . , N ) is one of the k randomly selected nearest neighbors of x, and random (0,1) is a function that generates a random number between 0 and 1. In this way, SMOTE creates a new synthetic sample along the line between the minority class sample and the selected nearest neighbors. We obtained 1,112 T6SE and 1,112 non-T6SE samples for the training set using this procedure.

4) MODEL CONSTRUCTION
The goal of ensemble methods is to combine the predictions of several estimators built with given learning algorithms to improve generalizability over a single estimate. Two families of ensemble methods exist, namely averaging methods and boosting methods. The voting classifier is one type of averaging method. It is a kind of wrapper containing different machine learning classifiers to classify data with combined voting. The idea behind the voting classifier is to conceptually combine different machine learning classifiers and use a majority vote or the average predicted probabilities to predict class labels. This type of ensemble classifier is useful for a set of equally well-performing models and balances out individual weaknesses [53]- [58]. ''Hard'' and ''soft'' voting methods exist to make decisions regarding the target class. In hard voting, the predicted class label for a particular sample is the class label that represents the majority of class labels predicted by each individual classifier. In contrast to hard voting, soft voting returns the class label as an argmax of the sum of predicted probabilities. The weighted average probabilities for a sample are calculated by multiplying the weight parameter assigned to each classifier and the predicted class probabilities for each classifier. The final class label is derived from the class with the highest average probability.
In our method, we use a soft voting strategy [59] that ensembles LogisticRegression, SVC [60], AdaBoost-Classifier [61], LinearDiscriminantAnalysis [62], Gradient-BoostingClassifier, and LGBMClassifier [63]. The six base classifiers are available in the scikit-learn package [62], [64], which is an open source machine learning library providing various built-in machine learning algorithms and models. We trained every base classifier on the same training set, we used a grid search to find optimal parameters, and then we ensembled results using the VotingClassifier [62].

C. PERFORMANCE AND EVALUATION
Choosing appropriate performance metrics is a necessary step in measuring how well a model works, and in comparing new models with existing methods. We use ACC, SN, SP, Matthew's correlation coefficient (MCC) [65]- [71], Area Under Receiver Operating Characteristic Curve (ROC AUC) [72]- [82], and F-score, all calculated by confusion matrices, gained according to true and predicted classes. The chosen metrics are defined as below: Both the training set and testing set were categorized into two classes: positive and negative. TP is defined as positive samples that classify positive, TN are negative samples that categorize negative, FP are negative samples that categorize positive, and FN are positive samples that categorize negative. SN measures the true positive rate, while SP measures the true negative rate; both are equally important for evaluating the model. ACC reflects the predictor's overall accuracy. When the dataset is imbalanced, ACC may be misleading [83]. However, MCC remains informative for measuring a model's overall quality. MCC ranges −1 to 1, where −1 represents that the predictor always predicts the wrong result, 0 indicates a random guess, and 1 denotes that the predictor predicts all samples accurately. Thus, MCC can be seen as a correlation coefficient between the true and predicted classes. The F1 score, also known as balanced F-score, can be interpreted as a weighted average of precision and recall. F1 ranges from 0 to 1, with precision and recall making equal contribution. ROC AUC is a plot of the true positive rate against the false positive rate. The greater the area under the curve is, the more accurate the prediction is. The maximum value of AUC is 1.0, which denotes a perfect prediction. A random guess gives an AUC value of 0.5. These metrics are usually adopted to evaluate prediction quality [84]- [86].

III. EVALUATION METRICS
Three primary cross-validation methods have been widely applied in evaluating the performance of bioinformatics predictors [87]- [102]. These are: jackknife cross-validation, 10-fold cross-validation, and independent testing. In this paper, we use two of them, namely 10-fold cross-validation and independent testing, to evaluate our model. In k-fold cross-validation, the original samples are randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k−1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once. For classification problems, stratified k-fold cross-validation is typically used, in which each fold contains roughly the same proportion of class labels. In practical use, k is usually set to k = 10. In 10-fold cross-validation, the dataset is divided into 10 parts (nine are used for training and the other for testing during each iteration). The average accuracy of the 10 results is seen as an estimate of the algorithm's overall accuracy. Usually, 10-fold cross-validation is performed multiple times.
Independent testing, also known as holdout testing, has no overlap between training and test sets. In other words, the training set is completely different than the testing set. When the testing set is divided from the training set, the distribution of the testing set should be similar to that of the training set; otherwise, the results of this testing strategy may be misleading [103]. In this paper, we used two different datasets collected separately: one for training, and the other for testing. Figures 3 and 4 show histograms obtained by different single models and the ensemble model. In this experiment, both the single models and the ensemble model were trained by the balanced training set after SMOTE, and the feature encoding is by k-separated-bigrams-PSSM. The most effective single predictor is SVC when considering both 10-fold crossvalidation and independent testing. The ensemble predictor, which integrates all the single models, is better and more stable than its base classifiers in terms of two kinds of evaluation metrics.

B. EVALUATION ACROSS FEATURE ENCODING METHODS
For each feature encoding, the ensemble predictor was trained by the original training set and validated by independent testing. Figure 5 shows the ROC AUC obtained by   different feature encoding. As shown, k-separated-bigrams-PSSM achieve the best AUC performance.

C. SMOTE EFFECTS
We used a comparison experiment to investigate SMOTE effectiveness. To ensure experimental objectivity and robustness, we conducted the same experiments on the different feature encodings. The predictor trained by the training set without and with SMOTE both achieved promising VOLUME 8, 2020  results, especially the latter, as evaluated by an independent test set. SMOTE improved the performance of all majority feature encodings, except Directional Property-PSSM (DP_PSSM) [104]. The comparison is shown in Figure 6. Generally, we conclude that SMOTE is helpful for the classification of T6SEs.

D. EVALUATION WITH 10-FOLD-CROSS-VALIDATION AND INDEPENDENT TESTING
After the above experiments, we used k-separated-bigrams-PSSM for feature encoding and SMOTE to assist with the dataset imbalance problem. An ensemble classifier integrating LogisticRegression, SVC, AdaBoostClassifier, LinearDiscriminantAnalysis, GradientBoostingClassifier, and LGBMClassifier was employed using a soft voting strategy to predict T6SEs. The results shown in Figure 7 indicate that our model is both effective and robust.

E. COMPARISON WITH THE STATE-OF-ART METHODS
To further validate the performance of our proposed method, we compared our model's results with the other existing methods listed in Table 1 below. Two other existing predictors can identify T6 effector proteins so far: Bastion6 and PyPredT6. As Table 1 shows, our method achieves the highest score in terms of accuracy and specificity. Compared with Basiton6, our method is somewhat inferior in terms of sensitivity, but it behaves better overall, because its specificity is far ahead in independent testing. Furthermore, our model exceeds PyPredT6 in ACC by 9.9%, SN by 5.9%, and SP by 9.3% on an independent dataset.

V. CONCLUSION
We collected a new independent testing set and developed an ensemble predictor integrating six base classifiers for recognizing T6SEs. Several experiments were performed to validate the effectiveness and robustness of our method. At first, we filtered the most effective feature encodings (k-separatedbigrams-PSSM) from various feature encodings. The results of models trained by different feature encodings show that PSSM-based features are more helpful than other features. Then we demonstrated that SMOTE is significant for most feature encodings. We next compared the performances of different single classifiers, finding that SVC is the most effective single model. We ultimately proved our ensemble classifier to be the most effective and robust T6SE predictor available, both in 10-fold cross-validation and independent testing. Our method is far ahead in terms of ACC and SP, compared with other existing methods. Taking into account all of our results, we conclude that our method is the best available predictor for screening experimental targets of T6SEs. In the future, we will pay more attention on the deep learning prediction techniques [105]- [108].

VI. AUTHOR CONTRIBUTIONS
C.W and M.G initiated the idea, conceived the whole process and drafted the manuscript. J.L implemented the experiments and designed the figures. Y.Z helped with data analysis and revised the manuscript. M.G and finalized the paper. All authors have read and approved the final manuscript.

ACKNOWLEDGMENT
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors would like to thank S. M. Thompson, from Liwen Bianji (www.liwenbianji.cn/ac), Edanz Editing China, for editing the English text of a draft of this manuscript.