Prediction of Therapeutic Peptides Using Machine Learning: Computational Models, Datasets, and Feature Encodings

Peptides, short-chained amino acids, have shown great potentials toward the investigation and evolution of novel medications for treatment or therapy. The wet-lab based discovery of potential therapeutic peptides and eventually drug development is a hard and time-consuming process. The computational prediction using machine learning (ML) methods can expedite and facilitate the discovery process of potential prospects with therapeutic effects. ML approaches have been practiced favorably and extensively within the area of proteins, DNA, and RNA to discover the hidden features and functional activities, moreover, recently been utilized for functional discovery of peptides for various therapeutics. In this paper, a systematic literature review (SLR) has been presented to recognize the data-sources, ML classifiers, and encoding schemes being utilized in the state-of-the-art computational models to predict therapeutic peptides. To conduct the SLR, fourty-one research articles have been selected carefully based on well-defined selection criteria. To the best of our knowledge, there is no such SLR available that provides a comprehensive review in this domain. In this article, we have proposed a taxonomy based on identified feature encodings, which may offer relational understandings to researchers. Similarly, the framework model for the computational prediction of the therapeutic peptides has been introduced to characterize the best practices and levels involved in the development of peptide prediction models. Lastly, common issues and challenges have been discussed to facilitate the researchers with encouraging future directions in the field of computational prediction of therapeutic peptides.


I. INTRODUCTION
Peptides are usually small chained amino acids (<50 in length), employing critical impact on humanoid physiological activities like neurotransmitters, anti-infective, ion channel ligands growth factors, and hormones [1]. Light toxicity and high specificity in regular environments make these short-chain peptides worthwhile and beneficial in the treatment [2]. Almost over seven thousand peptides with distinct characteristics have been discovered in the past, for instance, cell penetration peptides (CPP), anticancer (ACP), antiviral peptides (AVP), anti-inflammatory (AIP), and antimicrobial peptides (AMP) [3]. The peptides with these important characteristics make them vital candidates to propose new drugs The associate editor coordinating the review of this manuscript and approving it for publication was Quan Zou .
for therapy. As a precedent, ACP's are being utilized for the treatment of cancer disease [4]- [7], and CCP's are observed supportive in intracellular communication and considered as a useful medium for drug delivery [8], [9].
Inflammation occurs in the reaction of any type of physical damage or injury, due to the non-specific response of the immune system [10]- [12]. AIP's are recently used in the remedy of several inflammatory disorders like Alzheimer disease [13]. Similarly, natural and synthetic peptides have the potential to constrain the manifestation of inflammatory promotors [14].
In several human pathologies, angiogenesis invokes pathological process that creates dense tumor in human body [15]. Antiangiogenic peptide's (AAP) discovery is an encouraging remedial channel for the treatment of tumor and cancerous cells by obstructing the angiogenic process. Therefore, the accurate identification of AAP is mandatory to recognize their physicochemical characteristics. Another, globally preeminent mortality occurs due to cancer, and significant concentration has been made for the utilization of peptide-based therapies [16]. The most important uses where the peptide-based therapeutic methods have quickly grown is cancer treatment [6], [7], [17], [18]. The new peptide-based cancer treatment methods developed in recent years extending from naturally available and synthetic peptides to peptide conjugate drugs and small molecule peptides [7], [19]. AMP has a large-scale of actions and capabilities to operate in viral, bacterial, fungal infections, involved in mitogenic and signaling processes [20] and also shown anticancer activity. Sipuleucel-T with the trade name PROVENGE [21], [22] and Leuprolide with trade name LUPRON [1], [23], [24] are two examples from the list of FDA approved peptide-based cancer drugs for the treatment of prostate cancer. Furthermore, the HPRP-A1 peptide shown excellent antimicrobial and anticancer activity, according to the experimental study by Cuihua Hu et al. [6] and Wenjing Hao et al. [18]. Moreover, S. S. Usmani et al. has developed a repository having a list of FDA approved peptides and protein based drugs [5].
The production of peptide-based biologic medications are less complex and low cost comparatively of the traditional drug development process [1]. Thus, the identification of peptides with remedial characteristics is very important to invent new and effective therapeutic drugs, which accelerates their use in clinical treatment [25]. Consequently, there is immense potential for the discovery of generic, novel, and experimentally expandable solutions for precise prediction of therapeutic peptides.
The prime focus of this manuscript is to identify and discourse the diverse ML based peptide prediction models. Numerous studies have been reported in this field and gaining importance due to the effectiveness of peptide-based therapeutics. Thus, a comprehensive investigation is essential to recognize and summarize the current research developments in the field. However, there is no such review has been reported as per our best knowledge. This SLR reflects various aspects of the investigation in this area, including the use of various data-sources for dataset collection, encoding schemes to convert categorical data into ML understandable, current challenges and future prospects. Moreover, feature encoding taxonomy and framework model has also been proposed.
The next section of this paper is formulated as: Section II has been designated for the research methodology adopted in this review, describes research objectives, the research questions and motivations, search strategy, selection procedure to obtain relevant articles, abstract based keywording to classifying the articles and quality assessment criteria. Analysis and results representation has been presented in section III to deliberate the extracted outcomes and to response the research queries systematically. In section IV the research findings have been summarized by providing comprehensive discussions, proposed taxonomy and a generic framework for therapeutic peptide prediction model. Open issues and challenges have been discussed to offer imminent perspectives to the research community in section V. Lastly, the review has been concluded in section VI.

II. RESEARCH METHODOLOGY
The SLR provides a complete procedure to assist in the collection, investigation, and determination of main articles out of all available studies in the field under consideration. The guidelines for SLR proposed in 2004 by [26] has been followed in this study for unbiased data collection and demonstration of analyzed and extracted outcomes.  Figure 1 depicts the research process being adopted for this SLR. It is an appropriate and a reflective review procedure consists of six levels: 1) Definition of research objectives 2) definition of research questions 3) formulation of search strategy 4) screening and selection of papers 5) papers classification using keywords 6) information extraction and synthesis.

A. RESEARCH OBJECTIVES (RO)
The prime intents of this research are as follows: RO1: The primary focus is to recognize the high-tech investigations in the area of computational prediction of therapeutic peptides. VOLUME 8, 2020 RO2: Descriptive representation of the prevailing datasources, feature encodings, and ML classifiers being used in the area of interest. RO3: Hierarchal categorization of different feature encodings through proposed taxonomy, which underscores the successfully adopted schemes.
RO4: A generic framework proposal to provide a guideline to researchers for future developments in the area.
RO5: Identification of the main issues and unsolved challenges to recognize future research prospects.

B. RESEARCH QUESTIONS (RQ)
To carry out this SLR effectively, initially, the major research questions have been defined. Further, a comprehensive search planning required in the review for the identification and extraction of the most significant articles has been established. The research questions addressed in this review with their major motivation are mentioned in Table 1. The questions are addressed and responded in the light of well-defined procedure adopted in [26], [27].

C. SEARCH SCHEME
The crucial phase of SLR is the preparation of a search plan to adequately discover and collect possibly significant articles in the chosen field. This process entails the description of a search string, literature resources used to apply the search, and the segregation (inclusion/exclusion) plan to obtain the most relevant article's out of the collection. Numerous aspects of collected articles were assessed qualitatively and empirically to represent various perspectives associated with the research.

1) SEARCH STRING
An effective and fair investigation has been conducted by formulating a keyword-based string to search and gather available studies in the field using various well-known digital research repositories. To assure the authenticity of the search string regarding the relevancy of its results, the chief concepts have been analyzed in the light of research questions to obtain relevant keywords and terms used in the selected field of study. The finalized keywords and their alternative terms (synonyms) required to formalize a search string for the identification of most relevant articles are specified in Table 2. In Table 2 the '+' sign is used to express the inclusion and '-' sign for exclusion of studies having such terms. The logical ''AND'' and ''OR'' operators were used to combine the finalized keywords and alternative terms to form a search string. The '' * '' wild character was also utilized to indicate zero or more characters where required. The ''OR'' operator allows extra options to search in, while the ''AND'' is to concatenate the terms to specify the search options and to confine the query to obtain relevant search results.
The finalized search string has three fragments. The first fragment of the string is used to confine the results related to terms computer based or computational and the next fragment relates to the prediction of peptides, while the last is used to confine the results from the inclusion of studies that are based on non-computation methods (i.e. Lab-based experiments). Equation (1) representing the mathematical formulation of the search string.
Here in (1), R stands for search results obtain against search string, '∀' representing 'for all', '∨' used for 'OR' operator and '∧' for 'AND' operator combining with the search terms expressed in Table 2 to formalize the complete search string according to each selected repository. The generic search term using (1) can be expressed as: ((computer based OR computational OR machine learning) AND (''peptide prediction'' OR ''prediction of peptide'' OR ''therapeutic peptides'') NOT (spectrometry OR binding))

2) LITERATURE RESOURCES
Field specific and most prominent journals have been selected to conduct the literature search from online repositories, dedicated for research publication and collection. The details of selected repositories, applied search strings and the results are mentioned in Table 3.

3) INCLUSION AND EXCLUSION CREITERIA
Parameters defined for inclusion criteria (IC) are: IC 1) Include studies that were primarily conducted for computational prediction of peptides IC 2) If the study was targeting the therapeutic effects of peptides.
The exclusion criteria applied on all the articles to exclude such studies that involved in lab-based experimental identification/prediction of peptides for their therapeutic effects or involved in other irrelevant procedures, like EC 1) If the study did not involve any computational prediction and only involved in-vivo/wet-lab experiments based identification of peptides.
EC 2) Haven't any therapeutic activity. EC 3) If the study was conducted for structural representation of peptides or peptide-protein mapping.
EC 4) If the study involved binding mechanism.

D. SELECTION OF RELEVANT PAPERS
To intact with relevancy, January 2010 to December 2019 was selected as publication period to search in. The primary search process produces numerous research articles, not all of them were specifically appropriate in accordance to the defined research questions, and also have duplication. Accordingly, the searched papers require re-assessment and screening to acquire truly important papers. For the determination of articles relevancy and screening, we utilized the process outlined in [28]. The first step is to filter the studies based on the titles and duplication removal. In this study a number of papers were available that are irrelevant to the selected domain, so filtered on the basis of title and irrelevant papers were barred as a beginning measure. In selection continuation, abstracts were carefully reviewed and subjected articles were included that describe the computational efforts in the selected area. Furthermore, next to examining the articles following inclusion criteria, the exclusion criterion was also used to refine the articles that deliberate the field of therapeutic peptide prediction however did not submit any substantial study contribution or involved clinical experiments. Finally, the selected articles after the described process were included in the succeeding assessment level.

E. ABSTRACT BASED KEYWORDING
Further, the screening and classification of articles have been carried out using the two-staged abstract based keywording procedure, presented in [22], to obtain relevant articles. The abstract was analyzed initially to perceive the primary idea of the article, its contribution to the domain and to discover the most pertinent keywords. The keywords identified from various articles are then combined through which a tremendous comprehension has been developed about the contribution of the research in the domain. Ultimately, these keywords have been used to classify the articles for mapping.
Based on the significance of therapeutics, the selected articles have been separated into four major categories; AIP, AAP, ACP, and AMP. These areas were chosen due to their direct importance in peptide-based therapeutics. Other domains like ''Cell-penetrating peptides (CPP)'', ''quorumsensing peptides (QSP)'', etc., are out of scope of this study. Furthermore, just for clarity that AMP has several recognized therapeutic activities like anti-viral, anti-bacterial, anti-fungal, etc [29], [30] considered and grouped here as AMP. VOLUME 8, 2020 F. QUALITY ASSESSMENT CRITERIA A systematic quality assessment of the selected articles is an important task of a systematic review. Since these studies differ according to the design, the assessment of quality has been carried out by the mean of the sequential evaluation process suggested in [31]. The process contains standards for the quality assessment considering all significant factors involved in research [32], comprising notion, plan of the study, way to collect data, analysis of data, discussion, and outcomes. Based on the above points to extend the quality of research, a questionnaire has created jointly with other authors (examine Table 4). Internal and external quality criteria have also been used to improve the assessment quality as adopted by [33]. The internal criteria belong to the internal quality assessment of an article while the external quality has been determined by considering the stability and reliability of the publication source of an article. The ''Journal Citation Reports (JCR)'' and the ''Computer Science Conference rankings (CORE)'' have been used to rate and measure the external quality [34]. The total score is the sum of the individual score of each criterion. The final score may be varied between minimum score 0 and maximum 10 scores and categorized as high ranked if the score is greater than 8, ranked average if the score is between 6 to 8 and low ranked if the score is less than 6.

III. DATA ANALYSIS
This segment compiles the results and presents a clear evaluation of all selected articles. The selected articles were investigated to effectively answer the research questions. The first part of this section discusses the search results obtained through the defined search string. Next to it is the description of the assessment score and the final part is dedicated to the comprehensive discussions to answer the research questions.

A. SEARCH RESULTS
Modern computational prediction of therapeutic peptide involves various aspects like the use of data-sources for the collection of benchmark dataset, feature extraction, computational model. The primary search process yields a total 2269 articles from various online data sources. On this collection, selection process described in the previous section was applied. The phases involved in the selection process have also been described below in Figure 2 and phase wise selection outcomes are expressed in Table 5.
The title based selection was performed by two authors in phase I (P-I), results in the selection of 216 articles. Next, the duplicate articles were removed in phase (P-II), and domain irrelevant articles were also screened on the basis of inclusion and exclusion criteria defined in previous section. For instance, the search procedure has also produced papers that were involved in wet-lab experimental identification of peptides or related to their binding capabilities, but not involved in computational prediction of therapeutic peptides, so excluded due to their complete irrelevance. The consensus was measured 89% using the Kappa factor [35] indicating the high agreement among authors in selection. Abstract based screening was applied in phase III (P-III) on the 137 resultant articles obtained from the previous phase, and finally in phase IV (P-IV) full text-based analysis was applied on 72 articles and total 41 articles were found most pertinent and finalized to include in this SLR for data extraction and analysis. Extremely realized digital libraries (DL) to publish research studies for various journals, conferences and workshops were employed to select studies for this systematic literature review as per search strategy shown in Table 3. Figure 3 shows the DL-wise distribution ratio of selected articles, includes PUBMED as the top-ranked with 41% share, 15% of PMC, 12% of PLOS ONE, 10% of Oxford Academics and ACM each, 5% of IEEEXplore and Science Direct each, and 2% of Springer Link. Publisher based staged wise selection status and distribution ratio of selected studies has already been shown in Table 5 above.

B. QUALITY ASSESSMENT SCORE
As per the scoring mechanism defined above the scores were given to each selected study, evaluated according to the internal and external criteria are mentioned in Table 6. The scores obtained through internal and external criteria are represented as I-Score and E-Score respectively and publication type as P. Type. Out of all selected articles, 32% received a highest score greater than 8 and categorized in the high ranked articles, 58% articles obtain average rank, while merely 4 out of 41 articles (10%) viewed in low rank according to the criteria. These facts are represented in Figure 4, demonstrating a high trust in the quality of selected articles. Though, no elimination was done on a quality basis.

C. ASSESSMENT AND DISCUSSION OF RESEARCH QUESTIONS
In this part, the fourty-one main articles were analyzed on the basis of research questions described in Table 1. The facts extracted after the analysis of the selected studies were discussed on the basis of questions for the evaluation of piecemeal information.     Data collection is the primary task to build a prediction model based on ML methods. The major data-sources are listed in Table 7, identified during the detailed review of selected articles. Most of the articles use multiple data-sources to collect their relevant dataset. Domain wise listing of data sources used in selected articles are mentioned in Table 8. Due to the limitations of available data most of the research has been conducted by collecting data from previous studies. Sixty percent of the included studies use the datasets of previously published articles combining various online resources according to the facts shown in Table 8.  An APD named as ''Antimicrobial Peptide Database'' had come to 2003 initially and expanded into APD3 on regular updates, comprises on 185 anti-cancer peptides, 80 antiparasitic, 172 anti-viral, 959 anti-fungal, 105 anti-HIV and 2,169 antibacterial [71]- [73]. CAMP stands for ''Collection of Antimicrobial Peptides Database'', freely available online and established for the improvement of current understanding of antimicrobial peptides. It is composed manually and recently has 3782 sequences of antimicrobial peptides, out of these 2766 were experimentally tested and 1016 predicted datasets according to their reference material [38], [74]. ''DADP (database of anuran defense peptides)'' was manually built, presently comprises of 2571 items out of which 921 sequences were mainly collected from previously published articles as well as from UniProt [75].
The ''Swiss Bioinformatics Institute (SIB)'', ''Protein Information Resource (PIR)'' and ''European Bioinformatics Institute (EBI)'' collaboratively hosting the UniProt consortium aimed to provide valuable protein knowledge-base. It provides a free and convenient way to access primary protein sequences and necessary explanations. Core features include a user-friendly website, sequence archiving, manually curated protein sequences and procurement of valuable information across a reference to other knowledge bases [76].
The Swiss-Prot is also the part of the UniProt knowledgebase having analyzed and manually interpreted sequences [77], [78]. The ''Protein Data Bank (PDB)'' is a primary source of data used in various online data sources like UniProt, providing a firm base for research in Bioinformatics with approximately 159670 structures [79]. The ''National Center for Biotechnology Information (NCBI)'' is another large data source of nucleic acid sequences, molecular modeling and protein clusters, i.e. GenBank [80].
IEDB lists experimentally validated data almost 120000 structures investigated in human and non-human species. Scientific researchers can easily access online data from the provided web interface, see Table 7. The ''Immune Epitope Database (IEDB)'' also contains tools to help predict and analyze sequences [81]. As claimed in [82], cancerPPD contains anti-cancer based 3491 peptides and 121 protein entries with comprehensive information. It is manually curated and data gathered from other databases, patents and previously published research papers.

3) ASSESSMENT OF QUESTION 3. WHICH FEATURE ENCODING SCHEMES ARE BEING USED TO TRAIN THE COMPUTATIONAL PREDICTION MODELS?
The peptides are of divers length sequences of amino acids containing biological information [16], while the machine learning methods required feature vectors of fixed length [83]. Accordingly, the first step should be the fixed length numerical formulation of these non-numeric sequences. Over time, various feature encoding schemes have been developed for this purpose. These schemes are developed to precisely indicate the sequential knowledge and exhibit associations within the peptide sequences. [29]. The sequence of a peptide can be mathematically expressed as: where p i represents the first amino acid residues, p 2 is the second residue, . . ., p N is the last amino acid residue in the peptide sequence P and N denotes sequence length. Each of the amino acid residues represented in (2)  The feature encodings schemes being used in selected articles are summarized in Table 9. Common supporting feature encoding tools may use to encode several types of features are detailed at the end of this section. However, distinctive tools are mentioned within the description of some feature encoding.

a: AMINO ACID COMPOSITION (AAC)
The peptides are the sequences of amino acids, and AAC is its compositional representation, where the relational arrangement of individual amino acid in a sequence is determined. It produces a twenty dimensional (20-D) numeric vector that describes the number of each type of amino acids normalized with the total number of amino acids in the length of a provided peptide sequence [56], [84], and can be calculated as: where L is the sequence length, i represents 20 natural amino acids, and N i is the amino acid of type i. As AAC gives the occurrence frequency of a given peptide, so it supports to analyze the compositional features of a class of peptides and facilitate to distinguish one class from another on the basis of this information [13]. For example, the amino acids Gln, Tyr, Ser, Arg and Leu were found in abundance in anti-inflammatory peptides, while in non-anti-inflammatory peptides Thr, Asp, Pro, Ala and Gly were found in greater extent [13].

b: DIPEPTIDE COMPOSITION (DC)
DC represents the percentage of all the potential amino acid pairs available in a particular peptide. As 20 diverse amino acids exist in nature, so the 400 (20 2 ) possible peptide pair may exist, accordingly, produce a feature vector of size 400 vector size. DC carries information regarding the adjacent pair (neighboring residues) of amino acids in a peptide sequence [13], [59], [84]. The ratio of dipeptide may vary from one class to another class, thereby, this kind of extracted information may support in peptide type specification. For instance, the anti-inflammatory peptides were found enriched in the different amino acid pair compared to non-anti-inflammatory peptides [13].
Equation (3) represents mathematical formulation to calculate DC feature vector [59]: where D ij represents the number of dipeptides with specific type of amino acid pair i and j. 20 i=1 20 j=1 D ij represents all possible dipeptides in the given peptide sequence respectively.

c: TRIPEPTIDE COMPOSITION (TC)
Similar to DC, the TC measures the rate of all potential tripeptides that a sequence may contain. The TC is being used to transform the amino acid order of varying sizes into a fixed-size vector of 8000 (20 3 ) dimension in size [13].
Likewise, other composition based features, it was being analyzed that different peptide classes may have different frequency of tripeptides combination so TC may also helpful in peptide classification or recognition [13]. The calculation of TC is similar to dipeptide composition, however the only difference is it takes into account the information of three consecutive peptides [59] and can be express as: where T ijk represents the number of tripeptides with specific type of adjacent amino acid residues i, j, and k.
20 j=1 20 k=1 T ijk represents all possible tripeptides in the given peptide sequence respectively.

d: G-GAP DIPEPTIDE COMPOSITION (GDC)
The GDC is an extension of DC (Dipeptide Composition) defined above, numerically represents the relationship of adjoining amino acids (AA i , AA i+1 ) in a given sequence, whereas the DC descriptor was unable to encapsulate similar information about non-adjacent amino acids (AA i , AA j ; j − i > 1). However, the more essential properties may be placed in the higher level of correlation of amino acid residues. The non-adjacent amino acid residues are found closer spatially in tertiary structure reveals the extended significance than the adjoining amino acid residues in biology. Specially in some common secondary structures, like beta-sheet and alpha-helix are two non-adjacent residues, but spatially closer and connected with hydrogen bonds [85]. Thus, non-adjacent correlation may be useful for type categorization of a peptide. Therefore, GDC introduced, a 400-dimension descriptor capable to encompass the information on relational arrangement between non-adjacent amino acids from the sequence of a given peptide [16]. GDC descriptor can be express as: where f g i represents the frequency of i th g-gap dipeptide existence in a given peptide sequence with i = 1, 2, 3, · · · , 400, and f g i is evaluated as: where N g i is the number of existence of i th g-gap dipeptide in a sequence.

e: ADAPTIVE SKIP DIPEPTIDE COMPOSITION (ASDC)
Wei et al. proposed ASDC, a modified version of dipeptide composition to retain the correlation information between adjacent and intervening amino acids in a peptide sequence [86]. Therefore, the ASDC carries the properties of both the DC and GDC, and is a of 400 dimension feature vector [61]. The feature descriptor of ASDC for a given peptide sequence is expressed as: and v i is evaluated as: where v i denotes the frequency vector contains the occurrence of all potential pairs of intervening amino acid residues less than or equal to L-1.

f: K-SPACED AMINO ACID PAIRS (KSAAP)
For bio-informatics studies, KSAAP is widely used feature descriptor. KSAAP descriptor is used to obtain the compositional information of amino acids from a peptide sequence by varying the space (K ) between the pair of amino acids [54], [87]. A sequence of peptide has 400 (20 × 20) types of different amino acid pairs, for instance, (AA, AB, AC, . . ., YY) 400 , when taking K = 0. KSAAP can be expressed as [88]: where K represents the space between pairs of amino acids p i and p j . KSAAP behaves like dipeptide composition of 400 dimensions of 20 × 20 pairs with value of K equal to 0, while K can be adjusted to any optimal value to obtain useful ordered information of amino acid pairs. As discussed above, this kind of compositional information may be useful to categorize different sequences. Refer to Hasan et al. [88] for further detailed process. KSAAP encoding can be computed by using iFeature, available as a standalone tool and web server as well [89].

g: PSEUDO-AMINO ACID COMPOSITION (PSE-AAC)
The AAC, DC and TC were practiced widely in the field of bioinformatics for the classification or prediction of several proteins/peptides, while these compositions may lose the sequence order information. To address the issue, Pse-AAC was proposed [49], [90]. Unlike the previously described compositions, the Pse-AAC takes into the account of sequence order information of amino acids. Pse-AAC rapidly gain increased use in computational proteomics and Bioinformatics [83] later-after it was proposed by Chou [91]. The Pse-AAC measures the correlation among different ranges of amino acid pairs which may vary in the different classes of peptides. According to the Chou's concepts Pse-AAC can be expressed as [92]: where the first 20 components represent the composition of the 20 naturally occurring amino acids similar to AAC. Whereas the 20 + 1 to 20 + λ are additional components describe the correlations and sequence order information of amino acids via several modes along the length of the peptide sequence, and T is the transpose operator. Equation (10) can underline all the modes of Pse-AAC as [92], [69]: where f i (i = 1, 2, · · · , 20) is same to AAC representation and p j (j = 1, 2, · · · , λ) represents the pseudo amino acid components, and w j is for weight factors. Refer to [49], [90]- [92] for further detailed descriptions. Currently, three compelling tools; ''propy'' [93], and ''Pse-AAC-Builder'' [94], to calculate ''Pse-AAC'' based features in several forms [90], while ''Pse-AAC-General'' [95] has been developed to derive general features based on ''Pse-AAC'' [92]. Furthermore, two highly robust, and effective, publicly available servers had been implemented entitled as ''Pse-in-One'' [96] and its next variant ''Pse-in-One 2.0'' [97] to extract features from given RNA, DNA, and peptide/protein sequences according to the requirements. Refer to [91] and [90] for additional details and comprehensions.

h: AMPHIPHILIC PSEUDO-AMINO ACID COMPOSITION (AM-PSE-AAC)
Various kinds of peptides have several characteristics like lipophilic and hydrophobic properties, which are commonly called as amphiphilic (AM) property. This AM property plays a central part in protein-folding, which determines its interaction and functional activity. To utilize these important features, Chou suggests a new scheme named as Am-Pse-AAC into the array of encodings [56], [98]. Similar to Pse-AAC, the Am-Pse-AAC also take into account the order information of peptide sequences. Am-Pse-AAC is a 20 + 2 λ dimensional feature vector, where the former 20 represents the composition of 20 natural amino acids, whereas the rest 2 λ numbers are to describe and determines the correlation factor of amphiphilic property distribution in a peptide/protein and can be formulated as [56]: where p 1 , p 2 , · · · p 20 , p 20+λ are the amino acid compositions, p 20+λ+1 , · · · , p 20+2λ the sequence correlation factors and T is the transpose operator. Refer to [98] for further details and mathematical calculation.

i: COMPOSITION-TRANSITION-DISTRIBUTION (CTD)
As the name suggests, the CTD feature descriptor comprised of three different attributes; composition, transition, and distribution to represent the global description of a given peptide sequence [99], [100]. The design goal of CTD is to derive the significant biological information from the amino acid sequences. For feature description, peptide sequences (amino acids) are divided into groups according to their property types and so numerically encoded by 1, 2, . . ., n according to the group to which it belongs. These properties may be charge, hydrophobicity, secondary structure, polarity, polarizability, solvent accessibility, and normalized van der Waals volume, etc. [16], [101]. On the basis of these groups Dubchak et al. [102] proposed CTD to encode features on the basis of global composition, sequence order, and physicochemical properties of amino acid sequences to predict different protein folding classes. Afterward, CTD has been applied in several sequences based classification models [12], [87].
In CTD, the C (Composition) computes the frequency of every amino acid group with some particular property in a peptide sequence, can be defined as [52]: where a i , i ∈ {1, 2, · · · , n} is the number of groups, n is the total number of groups, and L is the length of the sequence. The T (transition) describes the occurrence of amino acids of a particular property accompanied with the amino acids of another property, which can be expressed as [52]: where i, j ∈ {1, 2, · · · , n} represents the corresponding property group, n is the total number of groups, L is the length of sequence, and a i,j represents the number of dipeptide having amino acid residues belongs to two different groups. Lastly, the D (distribution) is the distribution measure of all amino acids belongs to a particular property group along the sequence length, measured by the location of first, 25%, 50%, 75% and 100% of the amino acid sequence [99], [102], which can be formulated as [52]: 1,n L , a 2,1 L , · · · , a 2,n L , a n−1,1 L , · · · , a n−1,n L where a i,1 is the sequence length in which the first i th group residue is placed. Similarly, a i,2 , a i,3 , . . . , a i,n are the measures of i th group residues in the sequence length at the positon of 25%, 50%, 75% and 100% respectively. Referred to [52], and [101] for additional details. PCP is a most instinctive descriptor related to physicochemical properties, where the residues of a peptide translated into a specific property class. Numerous bio-chemical and bio-physical properties have been determined experimentally determined properties. Amino acid sequence based properties alone is sometimes insufficient to categorize the peptide sequences. In such a case the blend of several physicochemical and structural features like amino acid composition, amphiphilic and hydrophilic property, size, overall charge and VOLUME 8, 2020 secondary structure are useful to represent inclusive characteristics of a peptide [36]. For instance, the prediction of AMP has been carried out using physicochemical properties and obtained good results by Torrent et al. [42]. Various tools have been established to obtain physicochemical properties. Khawashima et al. [103] established amino acid index (AAI) database containing bio-chemical and bio-physical properties as numerical indices [29]. The AAI is classified into 3 divisions: AAindex1 (AAI-1), AAindex2 (AAI-2), and AAindex3 (AAI-3). The AAI-1 comprised of biochemical characteristics of 20 natural amino acids that can be represented as numerical values and referred to as an amino acid index. Currently, AAI-1 has 544 amino acid indices. The AAI-2 provides aggregates of various substitution matrices like BLOUSM62 or PAM26 [29] and currently have 94 substitution matrices; 67 symmetric and 27 non-symmetric. The last AAI-3 section is to provide protein contact potentials [103], so empirical values for spatially adjacent amino acids, like Gibbs energy change, to specify the interactions among amino acid pairs [29]. AAI-3 currently has 47 entries with 44 symmetric and 3 non-symmetric amino acid contact potential matrices [103].

k: MOTIF BASED FEATURES (MBF)
A motif is a repeated pattern of sequences in a collection of peptide/protein. Previously, various bio-informatics articles have examined motifs pertaining immunological characteristics in peptides for several purposes [104]- [106]. The repeated pattern (motifs) in amino acids having specific properties are responsible for the interaction of peptides. Therefore, the occurrence of these particular repeated patterns can be used in the classification and identification of the peptide types subject-wise [13]. Identifying positive dataset motifs by comparing with motifs of negative dataset provides high-class motifs for positive dataset characterization. Numerous publicly accessible tools were developed to determine motifs, i.e. MERCI [107] and MEME [108]. MERCI is one of them, uses physicochemical properties to identify motifs for a given class of amino acids. It provides functionally relevant, recurrent and high-class motifs of the positive dataset which are not available in negative sample [13], [107]. MEME-Suite is an integrated platform with web interface to provide multiple functions like database searching (motif-sequence and motifmotif), and the primary one motif detection [36], [108]. Sequence alignment in bio-informatics is a method of organizing nucleotides or amino acid sequences to recognize similar areas in sequences [109]. This similarity may be due to evolutionary relationships or based on structural or functional associations. The sequences are presented as rows of the matrix, where spaces are inserted within the residues so that columns may be aligned based on similar characteristics. The segments of a sequence that have high similarity are expected to have relatively same structural and functional activities [110]. For example, the peptide sequence P is considered to share the same characteristics as of Pk in a given set of peptides (P1, P2, . . ., Pn) if P obtains high-scoring segment pairs (HSPs) score among the query peptide P and the peptides set. This similarity measure can be obtained by using several well defined tools [39].
Various sophisticated alignment based tools have been created like BLAST [111], BLASTP [112], HMMER [113], FASTA [114], PSI-BLAST [115], and Smith-Waterman algorithm [116]. These sequence alignment programs have been extensively practiced for protein/peptide prediction previously [36], [39], [110], [117]. Alignment based approaches are fairly simple and straightforward, however, failed to perform with a peptide sequence that did not have any substantial similarities to any known peptides [60]. m: CHAOS GAME REPRESENTATION (CGR) CGR was suggested by Jeffrey in 1990 specifically to represent DNA sequences [118] and various generalize variant of CGR has been reported to represent amino acid (protein/peptide) sequences both in numerical and graphical form for their analysis. Different CGR's are outlined for different sequences in the injective behavior of the repeated function, which is quite an important characteristic of CGR [57]. It has been analyzed that, different kind of proteins/peptides demonstrate specific CGR patterns. This CGR feature established its effective use in the identification of new members of the specific peptide type. CGR has further benefits over the sequence based alignment tools usually use to search sequence homology that the CGR has a great potential to unveil the functional and evolutionary association among the proteins/peptides that have no substantial sequence similarity (homology) [119], [120]. C-GRex [121] is a standalone tool available to derive the CGR features.
The sequence mapping with CGR is invertible, therefore the original sequences can be recovered. CGS can be mathematically represented by a repeated function as [57]: x 0 = (0, 0) and x n = (x n−1 + ν n )1/2 (17) where ν n represents the nth vertex related to the n th base.

n: DISCRETE FOURIER TRANSFORMATION (DFT)
DFT is used to discover the periodic patterns and patterns relative strength by transforming the numerical data into the frequency domain. In bioinformatics, the DFT is applied nicely in bioinformatics like image processing, DNA hierarchical analysis, gene prediction, protein/peptide sequence analysis and prediction and to identify the coding regions. It is successfully applied to DNA, proteins and peptide sequences because the DFT spectrum of sequences represents the distribution and periodic patterns in the sequences [122].
The amino acid sequences of proteins or peptides are first transformed into numerical representations usually using physicochemical properties like hydrophobicity, hydrophilicity, etc. Then this numerical information is transformed to its corresponding frequency domain using discrete Fourier transformation (DFT) to obtain discrete components, which may reveal the hidden significant features from the sequences without loss of information [52], [123], [124]. The frequency domain shows the discrete component's periodicity and power distribution contained in the peptide sequence over frequencies [52]. The DFT of a peptide sequence at frequency k can be calculated as: L nk , k = n = 0, 1, 2, · · · , L − 1 (18) where f k denotes the compositional patterns and periodic properties of the sequence by sinusoidal components with various frequencies and H (P n ) represent the values of physicochemical properties of each amino acid in the given peptide sequence of length L. The power spectrum of DFT with frequency k can be calculated as:

p: SECONDARY STRUCTURE BASED ENCODINGS (SSE)
Several studies have reported the use of structural features for the classification of proteins or peptides [40], [51]. The simplest structure of a protein or peptide is termed as the primary structure, consists of a sequential order of amino acids. While the secondary structure is the higher level representation of the protein or peptide. The secondary structures are highly associated with the functional activity of proteins or peptides. For instance, Alpha-Helix, and Beta-sheet are two core categories of secondary structure. The combine use of both (primary and secondary) structure-based descriptor enhances the overall accuracy by allowing the generalization capabilities of the classifier [29]. Numerous tools are available to predict the secondary structure from a given amino acid sequence such as Spi-der2 [130], PEP2D (http://crdd.osdd.net/raghava/pep2d/), and Protein secondary structure prediction (PSSpred) [51]. Khatun et al. use Spider2 and PEP2D to obtain structurebased descriptors such as alpha-Helix, beta-Sheet, and random Coil along with accessible surface area (ASA) and beta-torsion angles [54].

q: PROTEIN RELATEDNESS MEASURE (PRM)
Currently, the sequence alignment methods being used to measure protein's similarities are usually slow and required assumptions and evolutionary mechanisms about relationships. Carr et al. proposed PRM, which exhibits quickness as well as usefulness in the revelation of the functional and phylogenetic connections among protein/peptide sequences [131].
The PRM property is used to measure the degree of deviation of amino acids in a specific sequence compared to theoretical protein/peptide [63]. PRM descriptor is comprised of three measures; Type I (compositional), Type II (centrodial), and Type III (distributional). The first is to estimate the degree of deviation from the expected proportion of amino acid in the sequence. Second is to estimate the presence of some particular amino acids in the specific area of a sequence. And the last is to estimate the proportion to which amino acid groups occur alongside the length of the amino acid sequence [63], [132]. PRM is a 60-dimensional feature vector contains 20 dimensions for each measure. Referred to [131] for detailed calculation process and source code in C++ and Mathematica to encode PRM feature.

r: BINARY PROFILE FEATURE (BPF)
Proteins/peptides are based on the combination of several amino acids, that signifies their functional activities. In this descriptor each residue is encoded in binary form (0/1) according to amino acid type out of twenty natural amino acids already mentioned above. The presence of an amino acid is coded as a function of each amino acid in order form as f(A) = (1, 0, . . ., 0), for the first type of amino acid ''A'' and f(c) = (0, 1, . . ., 0), for the second type ''C'' and so on [64]. For a certain peptide sequence with amino acid length K , BPF can be encoded as [16]: where k is the length of peptide sequence as a whole or can be specified in the form of N-terminus (amineterminus/start of the amino acid chain) or C-terminus (carboxyl-terminus/end of the amino acid chain) of protein or peptide sequence [16], [64]. Therefore, the BPF is a 20 × k dimension descriptor. Numerous common tools or web servers are available to encode several features from the given protein/peptide sequences. For instance, anyone of propy [93], iLearn [133], iFeature [89], Pfeature [134], Protr [135], PyBioMed [136], PseAAC-General, Pse-in-one, and Pse-in-one 2.0 can be used to obtain feature descriptors of type AAC, DC, TC, Pse-AAC, Am-Pse-AAC, and CTD, however BPF can be derived by using anyone of the iFeature, Pfeature, Pse-in-One, and Pse-in-One 2.0.

4) ASSESSMENT OF QUESTION 4. WHAT ARE THE MAJOR ML CLASSIFIERS USED IN PEPTIDE PREDICTION MODELS?
In the answer of the previous question, we have extensively elaborate the feature encoding schemes. Likewise, in this VOLUME 8, 2020 section we will elaborate the various ML classifiers, conventional as well as advanced, used in the field of study. Numerous statistical or artificial intelligence (AI) based algorithms such as logistic regression and ML-based methods have been used in literature to build classifiers for the prediction or classification of peptides and protein to explore the functional activities [137].
Machine learning algorithms such as Support Vector Machine (SVM) [55], [57], Random Forest (RF) [12], [38], and Advanced Machine Learning like Neural Network (NN) [42], [50] and Deep Neural Network (DNN) [46], [64], etc. All the classifiers that have been identified in this study are shown in Table 10 along with their references. SVM has been identified as the widely used algorithm, whereas RF is the second-most used classifier. However, the combinatorial use of different classifiers has also been reported, used as an ensemble classifier.

a: SUPPORT VECTOR MACHINE (SVM)
SVM is an ML-based algorithm works under the supervisory mode and trained with the labeled dataset to produce a model which can then be utilized to determine the correct label for unmarked data [16]. SVM practiced a statistic-based learning concept, originally used for twofold classification and was then adept successfully for multiclass problems [66]. SVM utilizes the basic concept of hyperplane to maximize the separation-margin among the classes of the dataset to classify. The optimal separation is achieved by converting the inputs into a higher feature order [67].
But it usually happens that the proposed feature set is not distinct and discrete. This issue is being solved by using the kernels to change the primary non-separable characteristics into a linear separable feature set, where an optimal hyper-plane classification is possible. Four classes of core functions that often employed for this purpose is a linear function, radial function (RBF), polynomial function, and sigmoid function [16], [67]. However, the most commonly used function is RBF in the classification of various therapeutic peptides.

b: RANDOM FOREST (RF)
The RF model uses two powerful (regulatory) procedures: bootstrap aggregation (bagging) and a random selection of feature, so also considered as an ensemble classifier works in the supervisory mode. It is widely practiced algorithm to solve the problems related to regression as well as classification. RF model builds on the basis of growing decision trees and split sampling technique with a replacement policy (bootstrap-sampling). Training dataset splits into two parts, one is used for model training, while the remaining (usually one third of the whole dataset) portion is used to test the model. This approach is known as out of the bag (OOB) approach offers better prediction performance and improve the reliability of the model for a dataset [3], [45].

c: ARTIFICIAL NEURAL NETWORKS (ANN)
NN is another machine learning model to provide artificial intelligence based solutions and extensively practiced in the area of Bioinformatics to recognize patterns, classification, and etc. [138]. In general, NN is a collection or web of artificial neurons or nodes, an artificial representation of the human brain, so usually termed as artificial neural network (ANN). These artificial neurons are interlinked in various ways to communicate with each other arranged in connected layers [139]. Input layer is the first layer, where the number of neurons are related to the number of input variables. After the input layer there could be zero or more hidden layers, while the last layer is termed as output layer where the number of neurons depends on the number of outputs, in example one for binary classification and could be more than one for multi-class labeling [139], [140]. The flow of information in NN is as follows: the information patterns (inputs) are accepted at the input layer of NN, which activates the neurons of hidden layer for outcome estimation and pass these outcomes to the output layer.
The neurons are computational/mathematical components, takes input and passed to the other neurons after performing the computation. The links among neurons have connected weights, where positive values reveal the exciting contact while non-positive considered as hindering. There are two phases of an ANN; learning (training) phase and operating phase (capable of classifying the patterns after being learned). In learning phase, NN is being trained with given inputs and labeled data are provided in advance, transform data by multiplying each node value with the relevant link's weight. Further, summing the outcomes of same node, regulate the resultant with node specific bias value and lastly, the output is achieved by normalizing the resultant value with a function termed as ''activation function'' [138], [140]. Mostly used activation functions are ''sigmoid'', ''hyperbolic tangent (TanH)'', radial basis function (RBF), and ''rectified linear unit (ReLU)'' [37], [46], [64]. To avoid overfitting, dropout layers are generally implemented in NN [46]. Feedforward neural network (FNN) [42], radial basis function (RBF) network (RBFN), generalize regression neural network (GRNN) and probabilistic neural network (PNN) [70] are variants of ANN being used in the selected studies. FNN has been used for several types of classification models. FNN is a simple neural network with one input, sigmoid hidden and output layers. The flow of information is unidirectional, from input through a hidden layer and then to output layer [42], [141]. Similar to FNN, the RBFN also comprised of three layers; the input, hidden with RBF (a non-linear activation function), and a linear output layer [52]. GRNN/PNN can be used for classification and decision making problems and both have a similar architecture comprised of four layers; the input, hidden, summation/pattern, and decision layers [70], [142], [143]. The input and hidden layers are same in both neural networks relative to functionality, while there is a difference in numbers and functions of neurons in summation/pattern, and decision layers. For instance, summation layer in case of GRNN has only two neurons; the denominator neuron and numerator neuron while PNN has one neuron for each target class. Furthermore, GRNN is used to implement regression problems that have continuous target variables [142], while PNN for classification with categorical target variables [143]. For further detailed descriptions refer to [142]- [145].

d: DEEP NEURAL NETWORKS (DNN)
DNN is an enhanced form of ANN comprised of multiple or higher number of network layers also known as the multilayer perceptron (MLP) uses well defined and improved algorithmic design. The enhanced architecture of DNN able it to extract best representative features of the given data set without any manual involvement or guide plan. Adaptive neuro-fuzzy inference system (ANIFS) [50], convolutional neural networks (CNN) [46], long short term memory (LSTM) a type of recurrent neural network (RNN) [37], [46], [64], are some variants of deep neural networks mostly used in the selected studies for their proposed prediction models, explained next.
CNN is a neural network, modeled on the basis of human visual system, applied in various fields to predict, recognize or identify targeted object(s) from the given input. CNN architecture typically comprises on multiple layers including the combination of several convolutional, pooling (subsampling) and dropout (regularization) layers accompanied by fully connected layer(s) [116].
The grouping of several convolution layers provides highly refined features at each layer level through the input layer to the output layer. Pooling or dropout layers are often interleaved among convolutional layers. The purpose of pooling layer (max-pooling or average-pooling) is to subsample or reduce the input representation. On the other end, the dropout layer is used to reduce the unnecessary dependencies of network on the features by dynamically lowering the number of neuron connections to avoid the overfitting and also reduce the training time. Thus, the neurons in each convolutional (feature extraction) layer are not fully connected. The final layers of CNN are fully connected layers and responsible for classification on the basis of collective features retrieved from previous layers [46], [139], [146].
LSTM is an advanced type of RNN that have feedback connections and allowing to store previous states in network memory. This state awareness feature of LSTM distinguishes it from other NN models and supports in the training process by incorporating previous state history when required. LSTM use gradient descent and time-based backpropagation training algorithm and also address the gradient vanishing problem by allowing the gradients to pass unaltered, which was a limitation of RNN [64], [139]. LSTM architecture consists of memory blocks (LSTM units) and control gates (input, forget, and output gates) to control the information flow between LSTM units. LSTM based models can monitor and retain the long-standing states or dependencies. That's why they are well learned from sequence inputs and building patterns that depends on context and previous states. The LSTM memory block keeps the relevant data from previous states to utilize in future calculation when required. The input gate determines the extent to which a new value flows into a cell, the forget gate determines the extent to which the value remains in the cell, and the output gate determines the usage of block value to calculate the output [37], [64], [139].
The ANIFS is a type of deep neural networks; an ensemble of adaptive control, fuzzy system and artificial neural networks with fuzzy inference system (FIS) as hidden layer [50], [147]. Fuzzy inference systems are widely used to understand the behavior of systems that are very easy to interpret [148]. These systems models human-like reasoning using plausible semantic terms. The specialist can develop the basic knowledge base using the predefined rules in the inference system, whereas the system is incomprehensible if no specialist is available [149]. Neural networks, on the other hand, have capabilities of learning, but cannot be interpreted [148]. However, due to this neuro-fuzzy combination and adaptive control technique, the ANFIS have the capability of both, the fuzzy logic (reasoning and interpretability) and neural networks of learning [147], [150].

e: CONDITIONAL RANDOM FIELD (CRF)
CRF models are statistical models deployed for various machine learning and pattern recognition based prediction problems. For example, classifying biological sequences [151], functional activity identification of peptides [41] and etc. It is an undirected graphical discriminative model overcome the biases in labeling. The CRF has the capability to predict a data label without considering the adjacent, however, could work in a contextual manner by realizing the perdition dependencies in a graphical model. For details, see [41].

f: LOGISTIC REGRESSION (LR)
LR is another statistical based classifier used to model the likelihood of some class, in example the input value belongs VOLUME 8, 2020 to a class under a determination or not [68]. LR is based on the logistic function (sigmoid), which converts any real value in between 0 and 1. LR provides dualistic associations among dependent and independent variables, however, extensions for multi-label classifications are also exists [52], [152].

g: NEAREST NEIGHBOR ALGORITHM (NNA)
NNA is a non-parametric and modest learning method used effectively for the recognition of patterns. It works on the basis of weights of its specified ''k'' nearest neighbors and classify an unidentified sample into a class, where k could be some optimum value for the neighbor to consider for weight-based membership calculation. The weight measure is usually a degree of distance among the unknown sample and its neighbors to classify [39].

h: INCREMENT OF DIVERSITY WITH QUADRATIC DISCRIMINANT (IDQD) ANALYSIS
Quadratic discriminant (QD) analysis is a modified form of linear discriminant analysis (LDA) that evaluates a covariance model for each observation class. QDA is especially beneficial in a situation when there is a piece of prior knowledge about each class that shows quite distinctive behavior [153]. Whereas IDQD has two fragments, one is an increment of diversity (ID) used to extract diverse information from the sample sequences as a class of sequence can be predicted with such information and the second fragment QD provides a scheme by assimilating the various extracted features into a system to predict [20].

i: NAIVE BAYES (NB)
NB is normally recognized as a simplest classifier effectively used in the field of bio-informatics [52], based on Bayes theorem [154]. Bayes theorem supports to find an occurrence probability of one variable on the basis of other when given. NB adopts robust assumptions about the independence of each feature to contribute its part equally and independently to get optimum results, so reduced the complications to develop a classifier [52].

j: DECISION TREE (DT)
DT is a classifier, effectively used in numerous fields to solve the prediction, classification or recognition problems. Conceivably the supreme property of DT is to provide easily deducible results by breaking-down the complex problems into simplest ones. It works great with both classification and regression related problems. DT has various variants and implementations, like Gordon et al. introduced ''CART'' (classification and regression tree) for analysis, having both procedural capabilities of classification and regression [155]. Greedy algorithm is used for learning purpose to decide splits in a tree at and optimum level and Gini-index to measure the degree of inequality.
''PART'', a partial decision tree, an algorithm that provides appropriate decision rules without considering global optimization [156]. ''RPART'' (recursive partitioning) is another R package specific implementation of ''CART'' [157]. Another ensemble classifier named as ''Deep-Boost'', use decision tree as a member, have achieved better accuracy by handling the data over-fitting problem [158]. Zahiri et al. used a combination of classifier by using ''PART'', ''RPART'' and ''Deep-Boost'' for their prediction model. The motivation behind this RQ is to discover the key publication sources from where the domain related articles can be acquired, and eventually the identification of excellent channels to publish articles to support the research community to have a predetermined source set. Table 11 represents the summarized description of target publication sources and the articles published in each source along with the (JCR/CORE) rank to present the quality of each selected article.
Of the total fourty-one selected articles, thirty-seven (93%) were published in journals and only three (7%) articles were presented in conferences. Moreover, out of total, fourty-one articles 76% (31 articles) were published in highly recognized journals with Q1 JCR rank, 15% (6 articles) publication sources have Q2 JCR rank, only 2% (1 article) have Q3, and all the conference articles, 7% (3 articles) as whole have B4 CORE rank, as shown in Figure 6. It can be analyzed from the facts presented in Table 11 and Figure 6 that the articles in the selected field were normally published in well-reputed publication sources.

IV. DISCUSSIONS
This section provides a detailed discussion of different ML based predictors applied in different areas of the selected domain. In the first instance, a feature encoding taxonomy has been proposed to sum up the results of this study. The taxonomy is depicted in Figure 7. In addition, a standard framework is proposed to support the development of a prediction model for the identification of peptides, shown in Figure 8.

A. TAXONOMY OF FEATURE ENCODINGS
Feature encodings identified in this study are being classified in hierarchal manner and presented in the taxonomical form here. Initially the feature encodings are classified in three major modes; structure based, property based, and sequence based, centered around the way of their formulation as recognized from the selected articles. The hierarchy is being tried to be developed on the basis of parent-child relationship. All these encoding approaches have already been discussed extensively in section III and its hierarchal classification is presented here in Figure 7.

B. FRAMEWORK FOR PEPTIDE PREDICTION MODEL
Proficient manual examination could serve better for the likelihood of candidate peptides in therapy of several diseases. Though, computational prediction is better than manual  inspection due to several reasons. It has been identified computationally that peptide sequences with several therapeutic activates share very low similarity with other peptide sequences have same functional properties. Consequently, it's quite difficult for an expert human to identify manually. Furthermore, high performance of computational prediction models permits the exclusive discovery and selection of unforeseen proteins/peptides predecessors. This phenomenon provides provision in an effective manner to experimentally validate the discovered predecessors in labs before the design of drugs [159].
A generic framework to build a prediction model is being proposed and depicted in Figure 8. The proposed framework comprised on five staged processes; dataset collection, dataset preprocessing, feature (encodings) selection and optimization, selection of ML classifier (MLC), and implementation of the finalized model in the form of a publicly accessible web-server or tool for experimental purposes.  Dataset collection is the primary but cumbersome task to build a successful predictive model, as there is no automated procedure is available for the collection of related datasets. Researchers gather it by examining extensive published articles, inspect, and curate the dataset to use in model, also some groups convert these collections into publicly accessible databases for the researchers to reveal new ways. During this review the identified databases are listed in Table 7 in section III and the number of peptide sequences searched in the selected studies are mentioned in Table 12 along with their references.
To improve the credibility of prediction model it is necessary that the collected dataset should have a lower sequence identity (homology), which may lead to models biasness and eventually affects its performance, normed as accuracy. Consequently, the next step is the preprocessing of collected dataset to clean it up from the redundant and identical sequences. The accuracy of the prediction model is based on various parameters including the availability of experimentally validated datasets, preprocessing and elimination of redundant data. ''CD-HIT (Cluster Database at High Identity with Tolerance)'' is a freely available online suite mostly used in the domain to reduce data redundancy. It takes amino acid sequences (proteins/peptides) or nucleotide sequences (DNA/RNA) in Fasta format as inputs and produces non-redundant sequence sets (datasets) and clusters of sequencing as output at some provided identitythreshold. Only the sequences having high similarity than the provided threshold are removed to reduce the dataset size without losing useful sequence information [160], [161]. Table 12 further represents paper-wise datasets related useful information regarding the breakup of a benchmark dataset into training, independent/test dataset (including number of positive and negative samples), programming languages or Tools used, framework/package / library used, and sequence identity tool with the threshold value used to remove the duplication in the dataset. It is being analyzed from Table 12 that mostly used sequence identity threshold values ranging from minimum 40% (0.4) and 90% (0.90) as maximum in different articles according to the size of available datasets. It is being noticed that the lower threshold for sequence identity may increase the credibility of constructing models, but the smaller size of the dataset restricts to apply such a threshold value specifically in the case of peptide prediction [61]. After the cleanup, the dataset should be distributed in the training set and independent/test set. For such a distribution no criterion yet has been defined, however, according to the identified best practices, the ratio of the distribution may vary from 70:30 to 80:20 for training and independent sets respectively, depending on the size of the whole dataset. The training set will be used to train the ML classifier model while the successor will be used to evaluate the model's transparency and actual performance towards unseen data.
The next stage in the development of classifier is feature selection & optimization, which is a threefold process. As a first step, the feature encoding(s) (single/multiple) are being selected to convert the diverse nature of dataset (amino acid sequences) into fixed-length feature vectors, the ML Classifiers understandable. Primarily identified feature encodings during this study has been extensively discussed in section III and the usage percentage is presented in Figure 9.
Most frequently used feature in this domain is PCP with 21% usage share, AAC is second with 20%, DC at third with 14%, PseAAC and SSE have usage share of 7% and 6% respectively, among the selected studies. Mostly published articles utilized various combinations of feature encodings to optimize the prediction credibility. However, [46], [57], and [63] referenced articles utilized single feature encodings which are k-mer, CGR, and PRM respectively. The trade-off among the selection of local or global feature encodings is still an open challenge that needs attention.
Combinatorial use of diverse encodings has a great chance to contain noise and redundancy in feature attributes, which may affect the training of model and cause overfitting, later such trained models performed poor with the independent or unseen dataset. Subsequently, feature optimization methods are applied to finalize the effective features for model training. Various optimized feature selection techniques have been practiced by means of different feature ranked methodologies. At first input features are ranked using ''F-score algorithm'' [61], ''minimum redundancy maximum relevance (mRMR)'' [3], [16] or simply RF based ranking and finally the optimized feature vectors based on best ''area under curve AUC'' are being determined by using ''sequential forward search (SFS)'' method. RF model with Gini-importance or mean decrease of Gini-index (MDGI)'' based ''rigorous feature elimination (RFE)'' was adopted for highly ranked features selection in several studies [12], [45], [49], [51], [56]. Both, RFE and SFS have been practiced  effectively, however, RFE has performance benefits over the SFS due to its non-complexed computational mechanism.
Next and most important phase is the selection of most appropriate ML based classification or regression model(s) according to the problem under consideration. The main focus behind the careful training of ML-based algorithm is that it should adopt the same predicted behavior on providing unknown dataset and classify it accurately. ML-based classifiers are being trained by providing the optimized features and known expected answers (labeled data) to make associations among the features and expected answers to learn. The most frequently used classifiers identified in the selected studies have been discussed in details in section III, and the usage share in the percentage of each classifier shown in Figure 10. Among the selected articles, 44% of the studies utilize SVM, second most frequent classifier is RF with 25% usage, DNN has 7% share, ANN and NNA has and equal share of 6% share, LR has a usage share of 4%. Generally, a ML classifier has customary hyperparameters, required regulation and tuning to obtain superlative grouping and therefore to acquire best prediction performance [59]. The ML classifiers learn using the datasets, and it's not feasible to use the same dataset for various purposes like, tuning of parameter, learning, and finally for validation. Normally, the cross validation (CV) methods are being used for training, validation and evaluation of success of a classifier. The main CV techniques are sub-sampling also known as ''k-fold test'' [12], [13], [44], ''leave-oneout''/''jackknife'' test [20], and independent test [45], [58] has often practiced in training, tuning and validation of classifiers. The use of these validation techniques is also represented in Table 12.
Numerous researchers practiced a single best fitted classifier for their prediction model, though, make a comparison of their results with other classifiers too. It has been advised here to train and test different ML-based classifiers, and decide the finalized model single or combination of multiple classifiers (classifier fusion) on the basis of performance. Some hybrid or customize use of several classifiers was also identified. For instance, D. Veltri et al. proposed a deep neural network model by employing embedding, convolutional and max-pooling layers for sequence pattern generalization and LSTM layers to characterize the given sequences [37]. Derived discrete features are fed to the embedding layer to convert into a fixed size representation which may semantically nearest to each other in the feature map. These features were convolved through the one dimensional convolutional layer having sixty-four filters and max pooling layer to down-sample the filter outputs of convolution layer. These downsized features were passed to a dense layer with sigmoid function through LSTM network to predict the outputs in 0-1 range and reported accuracy of the model is 91.01%. With the name of PTPD a deep learning based new computational model was proposed to predict therapeutic peptides. In PTPD, the first input layer was utilized to map the peptide sequences into a k-mer based numeric vectors by incorporating bagof-words and continuous skip-gram (two learning algorithms) through word2vec tool. These numeric representations (vectors) were fed to a convolutional layer implemented with three types of 1-dimensional convolution filters of different sizes and ReLU activation function to perform convolution on the whole vectors to create feature maps. A dense layer with ReLU activation function was adopted to combine the results of three convolutional filters after dimensionality reduction of feature maps through max-pooling layer and dropout function to avoid overfitting. Finally, a fully connected layer with sigmoid function was adopted for output. According to Chuanyan et al., PTPD effectively perform with 90.2% accuracy with several datasets. Torrent et al. developed an AMP prediction model by using a two-layer feedforward NN, sigmoid activation function in the hidden layer with 50 neurons and scaled conjugate gradient (SCG) based backpropagation algorithm to adjust the weights during learning process to minimize the error loss [42], and achieved 90% accuracy.
The classifier fusion approach was adopted by Zhang et al. [52] for AAP and Akbar et al. [70] for ACP classification models and achieved the accuracy of 83.2% and 96.45% respectively. Classifier fusion is the assembly of several individual basic classification models with different learning practices and then combines the results from all autonomous classification models to address the same classification problem. The outcomes in the sense of sensitivity and specificity from different individual models are usually different for a particular task. However, the fusion of best classifiers on sensitivity and specificity usually makes them a more suitable classification performer than an individual classifier [52], [162], [163], which has also been proved theoretically by Hansen and Salamon [164]. Zhang et al. [52] assessed the performance of various classification models, including RBFN, NB, LR, NNA and RF. The final result was then determined on the basis of average probability of outcomes from an individual classifier, which is best in specificity (best to predict negative samples) and another, which is best in sensitivity (best to predict positive samples). Similarly, Akbar et al. [70] utilized SVM, RF, PNN, GRNN, and NNA as operative classifiers and the optimal results were obtained by fusing the outputs of the individual classifier with evolutionary genetic algorithm and majority voting.
In ML, the quality or prediction performance is measured using various parameters. One of them is confusion matrix, provides the measure of accuracy of identified values. The accuracy measure is not the only parameter that can be considered sufficient for the evaluation of performance of a predictor. Therefore, various parameters are being considered to measure performance: sensitivity, usually the percentage of truly identified positive samples, specificity measures the truly identified negative samples while is the measure of correctly identified samples (positive + negative), and generalized measure termed as ''Matthews correlation coefficient (MCC)'' generates a quantitative measure between +1 to −1 by considering proportion of whole set of true positive, false positive, true negative and false negative. The quantitative measure near or equal to +1 considered as accurate estimation, 0 for poor prediction, while −1 considered as complete divergence among the predicted and known annotations. Refer to [15], [41], [52], [67] for detailed overview and mathematical calculation. MCC might also be directly calculated using confusion matrix [165].
Receiver operating characteristic (ROC) curve has widely been adopted to measure the classifiers performance and for comparison purposes as well. ROC plots a graph to demonstrate the performance of a classifier under the relationship of truly identified positive (sensitivity) and negative (specificity) at various threshold levels. And the quantitative comparison among ROC's of various classifiers to rank is being accessed by using ''area under curve (AUC)'' [3], [12], [30].
The last step is to develop actual classification model based on the selected classifier(s) and its possible deployment on web. The web application of the trained model is not obligatory, though, the research community must have some way to access the prediction model and the dataset, either in the form of downloadable tools, APIs, or other ways. Such provision will be significant to extend the experiments in the domain.

V. ISSUES AND CHALLENGES
There are notable challenges for the utilization of ML-based models for the therapeutic prediction of peptide sequences. ML classifier demand accurately validated and quantitative datasets. Such quantitative selection and dissemination of datasets is a challenging task due to the costly lab-based experiments and other associated complexities, leading to the inadequate exposure of the entire sequence set of the domain.
To develop an ML classifier with the best-fitted model, several repetitive experiments are obligatory. Amongst numerous available encoding schemes and classification models, the scholars of the field normally practice their personal expertise to select the encodings and classifiers. Due to all these, different trained models use to predict the same peptide sequences, may produce different outcomes. The difficult procedures to extract features may also affect the selection of unbiased and comprehensive features. Furthermore, the exceeding number of features from the size of the actual dataset may impact on the performance of a model [166].
The significant problem in implementing an ML model for some prospective therapeutic is the inadequate explanation of the data and repeatability of the experimental results obtained with ML. Various already implemented prediction models do not provide the datasets, ML models or model codes to the new researcher community, which is the main hindrance in designing new models for experimentation in the same field. So, it is very important to provide them the datasets and model codes to find out the new ways [167].
Data repetition is another challenge, which may head to the lower and biased performance of predictor due to overfitting of the model. Ordinarily, when a predictor is trained over the noisy/repetitive data, it gains the irrelevant information while learning, subsequently, a predictor becomes over-fitted and perform inadequately with the unseen dataset and providing unexpected results. Redundancy removal is its remedy by setting the sequence identity threshold using CD-HIT. Normally, the insufficient volume of the dataset does not support the use of threshold value at some recommended level, i.e. 30% [104]. The counterpart of overfitting is underfitting, occurs while a model is unable to learn well and produces biased results on an independent dataset. This may usually cause due to the limited size of the dataset. The underfitting and overfitting is a prominent challenge in the absence of testing or an independent dataset.
Typically, to access the performance of the created prediction model, the researcher utilizes own created independent dataset, which causes biases in the assessment. However, for neutral evaluation of the created model as well as for the comparison of performance with other models, the construction of a distinctive independent dataset is quite sensible, so suggested here. Another matter that intrudes biases in the results of a classifier is dataset disparity. In such a case the accuracy is not the only criterion; though various metrics should be used to evaluate a model to avoid the effects of dataset disparity, like, confusion metric, precision score, recall score, F-Score or ROC [168].
Many ML classifier models are available only for the prediction of the specific type of peptides. There is a dearth of the availability of common models to predict or classify multiple types of peptides as a whole. However, Chunrui et al. developed a method to predict ACP, allergenic, and virulence prospects from peptides [43], and Wei et al. established a VOLUME 8, 2020 model for the prediction of eight types of peptides connected to therapy [3]. Conclusively, underlining the necessity for the development of general and progressive prediction models, proficient to predict several types of peptides and their functional activities.

VI. CONCLUSION
In this article, a systematic literature review has been conducted that provides a comprehensive discussion based on qualitatively selected research papers available in the field of computational prediction of therapeutic peptides. The study was carried out using a well-defined, systematic method for the selection of the fourty-one published articles. An analysis of the various ML classifiers, feature extraction schemes, and data-sources have been introduced in this study. Based on identified best practices from the selected articles, a framework has been presented as a guideline for the upcoming developments in the domain. Similarly, the feature encodings have been classified and presented in the taxonomic order. Moreover, issues, challenges and future prospects have been discussed, which may provide guidelines for the researchers of the domain. Furthermore, it is being advised to the researchers of the domain to only provide the most reliable prediction models for equitable future comparisons, instead to share all experimental models as practiced previously. However, various peptide predictors are available in the field, but the rapidly growing amount of datasets yet emphasizing the demand of new approaches and qualitative ML methods for the exploration of information beneficial in therapeutics and drug design.