By Topic

Computational Biology and Bioinformatics, IEEE/ACM Transactions on

Issue 4 • Date July-Aug. 2012

Filter Results

Displaying Results 1 - 25 of 36
  • [Front cover]

    Page(s): c1
    Save to Project icon | Request Permissions | PDF file iconPDF (935 KB)  
    Freely Available from IEEE
  • [Inside front cover]

    Page(s): c2
    Save to Project icon | Request Permissions | PDF file iconPDF (172 KB)  
    Freely Available from IEEE
  • Guest Editorial: Bioinformatics and Computational Systems Biology

    Page(s): 945 - 946
    Save to Project icon | Request Permissions | PDF file iconPDF (66 KB)  
    Freely Available from IEEE
  • A Sparse Regulatory Network of Copy-Number Driven Gene Expression Reveals Putative Breast Cancer Oncogenes

    Page(s): 947 - 954
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1047 KB)  

    Copy number aberrations are recognized to be important in cancer as they may localize to regions harboring oncogenes or tumor suppressors. Such genomic alterations mediate phenotypic changes through their impact on expression. Both cis- and transacting alterations are important since they may help to elucidate putative cancer genes. However, amidst numerous passenger genes, trans-effects are less well studied due to the computational difficulty in detecting weak and sparse signals in the data, and yet may influence multiple genes on a global scale. We propose an integrative approach to learn a sparse interaction network of DNA copy-number regions with their downstream transcriptional targets in breast cancer. With respect to goodness of fit on both simulated and real data, the performance of sparse network inference is no worse than other state-of-the-art models but with the advantage of simultaneous feature selection and efficiency. The DNA-RNA interaction network helps to distinguish copy-number driven expression alterations from those that are copy-number independent. Further, our approach yields a quantitative copy-number dependency score, which distinguishes cis-versus trans-effects. When applied to a breast cancer data set, numerous expression profiles were impacted by cis-acting copy-number alterations, including several known oncogenes such as GRB7, ERBB2, and LSM1. Several trans-acting alterations were also identified, impacting genes such as ADAM2 and BAGE, which warrant further investigation. Availability: An R package named lol is available from www.markowetzlab.org/software/lol.html. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Inference of Biological S-System Using the Separable Estimation Method and the Genetic Algorithm

    Page(s): 955 - 965
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (703 KB)  

    Reconstruction of a biological system from its experimental time series data is a challenging task in systems biology. The S-system which consists of a group of nonlinear ordinary differential equations (ODEs) is an effective model to characterize molecular biological systems and analyze the system dynamics. However, inference of S-systems without the knowledge of system structure is not a trivial task due to its nonlinearity and complexity. In this paper, a pruning separable parameter estimation algorithm (PSPEA) is proposed for inferring S-systems. This novel algorithm combines the separable parameter estimation method (SPEM) and a pruning strategy, which includes adding an ℓ1 regularization term to the objective function and pruning the solution with a threshold value. Then, this algorithm is combined with the continuous genetic algorithm (CGA) to form a hybrid algorithm that owns the properties of these two combined algorithms. The performance of the pruning strategy in the proposed algorithm is evaluated from two aspects: the parameter estimation error and structure identification accuracy. The results show that the proposed algorithm with the pruning strategy has much lower estimation error and much higher identification accuracy than the existing method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Identifying Gene Pathways Associated with Cancer Characteristics via Sparse Statistical Methods

    Page(s): 966 - 972
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (805 KB)  

    We propose a statistical method for uncovering gene pathways that characterize cancer heterogeneity. To incorporate knowledge of the pathways into the model, we define a set of activities of pathways from microarray gene expression data based on the Sparse Probabilistic Principal Component Analysis (SPPCA). A pathway activity logistic regression model is then formulated for cancer phenotype. To select pathway activities related to binary cancer phenotypes, we use the elastic net for the parameter estimation and derive a model selection criterion for selecting tuning parameters included in the model estimation. Our proposed method can also reverse-engineer gene networks based on the identified multiple pathways that enables us to discover novel gene-gene associations relating with the cancer phenotypes. We illustrate the whole process of the proposed method through the analysis of breast cancer gene expression data. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Stochastic Gene Expression Modeling with Hill Function for Switch-Like Gene Responses

    Page(s): 973 - 979
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (629 KB)  

    Gene expression models play a key role to understand the mechanisms of gene regulation whose aspects are grade and switch-like responses. Though many stochastic approaches attempt to explain the gene expression mechanisms, the Gillespie algorithm which is commonly used to simulate the stochastic models requires additional gene cascade to explain the switch-like behaviors of gene responses. In this study, we propose a stochastic gene expression model describing the switch-like behaviors of a gene by employing Hill functions to the conventional Gillespie algorithm. We assume eight processes of gene expression and their biologically appropriate reaction rates are estimated based on published literatures. We observed that the state of the system of the toggled switch model is rarely changed since the Hill function prevents the activation of involved proteins when their concentrations stay below a criterion. In ScbA-ScbR system, which can control the antibiotic metabolite production of microorganisms, our modified Gillespie algorithm successfully describes the switch-like behaviors of gene responses and oscillatory expressions which are consistent with the published experimental study. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Exploiting the Functional and Taxonomic Structure of Genomic Data by Probabilistic Topic Modeling

    Page(s): 980 - 991
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (2983 KB)  

    In this paper, we present a method that enable both homology-based approach and composition-based approach to further study the functional core (i.e., microbial core and gene core, correspondingly). In the proposed method, the identification of major functionality groups is achieved by generative topic modeling, which is able to extract useful information from unlabeled data. We first show that generative topic model can be used to model the taxon abundance information obtained by homology-based approach and study the microbial core. The model considers each sample as a "document,” which has a mixture of functional groups, while each functional group (also known as a "latent topic”) is a weight mixture of species. Therefore, estimating the generative topic model for taxon abundance data will uncover the distribution over latent functions (latent topic) in each sample. Second, we show that, generative topic model can also be used to study the genome-level composition of "N-mer” features (DNA subreads obtained by composition-based approaches). The model consider each genome as a mixture of latten genetic patterns (latent topics), while each functional pattern is a weighted mixture of the "N-mer” features, thus the existence of core genomes can be indicated by a set of common N-mer features. After studying the mutual information between latent topics and gene regions, we provide an explanation of the functional roles of uncovered latten genetic patterns. The experimental results demonstrate the effectiveness of proposed method. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predicting Ligand Binding Residues and Functional Sites Using Multipositional Correlations with Graph Theoretic Clustering and Kernel CCA

    Page(s): 992 - 1001
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (580 KB)  

    We present a new computational method for predicting ligand binding residues and functional sites in protein sequences. These residues and sites tend to be not only conserved, but also exhibit strong correlation due to the selection pressure during evolution in order to maintain the required structure and/or function. To explore the effect of correlations among multiple positions in the sequences, the method uses graph theoretic clustering and kernel-based canonical correlation analysis (kCCA) to identify binding and functional sites in protein sequences as the residues that exhibit strong correlation between the residues' evolutionary characterization at the sites and the structure-based functional classification of the proteins in the context of a functional family. The results of testing the method on two well-curated data sets show that the prediction accuracy as measured by Receiver Operating Characteristic (ROC) scores improves significantly when multipositional correlations are accounted for. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Guest Editors' Introduction to the Special Section on Bioinformatics Research and Applications

    Page(s): 1002 - 1003
    Save to Project icon | Request Permissions | PDF file iconPDF (72 KB)  
    Freely Available from IEEE
  • Fast Local Search for Unrooted Robinson-Foulds Supertrees

    Page(s): 1004 - 1013
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (720 KB)  

    A Robinson-Foulds (RF) supertree for a collection of input trees is a tree containing all the species in the input trees that is at minimum total RF distance to the input trees. Thus, an RF supertree is consistent with the maximum number of splits in the input trees. Constructing RF supertrees for rooted and unrooted data is NP-hard. Nevertheless, effective local search heuristics have been developed for the restricted case where the input trees and the supertree are rooted. We describe new heuristics, based on the Edge Contract and Refine (ECR) operation, that remove this restriction, thereby expanding the utility of RF supertrees. Our experimental results on simulated and empirical data sets show that our unrooted local search algorithms yield better supertrees than those obtained from MRP and rooted RF heuristics in terms of total RF distance to the input trees and, for simulated data, in terms of RF distance to the true tree. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Metric for Phylogenetic Trees Based on Matching

    Page(s): 1014 - 1022
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (522 KB)  

    Comparing two or more phylogenetic trees is a fundamental task in computational biology. The simplest outcome of such a comparison is a pairwise measure of similarity, dissimilarity, or distance. A large number of such measures have been proposed, but so far all suffer from problems varying from computational cost to lack of robustness; many can be shown to behave unexpectedly under certain plausible inputs. For instance, the widely used Robinson-Foulds distance is poorly distributed and thus affords little discrimination, while also lacking robustness in the face of very small changes-reattaching a single leaf elsewhere in a tree of any size can instantly maximize the distance. In this paper, we introduce a new pairwise distance measure, based on matching, for phylogenetic trees. We prove that our measure induces a metric on the space of trees, show how to compute it in low polynomial time, verify through statistical testing that it is robust, and finally note that it does not exhibit unexpected behavior under the same inputs that cause problems with other measures. We also illustrate its usefulness in clustering trees, demonstrating significant improvements in the quality of hierarchical clustering as compared to the same collections of trees clustered using the Robinson-Foulds distance. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • The Kernel of Maximum Agreement Subtrees

    Page(s): 1023 - 1031
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (734 KB)  

    A Maximum Agreement SubTree (MAST) is a largest subtree common to a set of trees and serves as a summary of common substructure in the trees. A single MAST can be misleading, however, since there can be an exponential number of MASTs, and two MASTs for the same tree set do not even necessarily share any leaves. In this paper, we introduce the notion of the Kernel Agreement SubTree (KAST), which is the summary of the common substructure in all MASTs, and show that it can be calculated in polynomial time (for trees with bounded degree). Suppose the input trees represent competing hypotheses for a particular phylogeny. We explore the utility of the KAST as a method to discern the common structure of confidence, and as a measure of how confident we are in a given tree set. We also show the trend of the KAST, as compared to other consensus methods, on the set of all trees visited during a Bayesian analysis of flatworm genomes. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Refining Regulatory Networks through Phylogenetic Transfer of Information

    Page(s): 1032 - 1045
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1905 KB)  

    The experimental determination of transcriptional regulatory networks in the laboratory remains difficult and time-consuming, while computational methods to infer these networks provide only modest accuracy. The latter can be attributed partly to the limitations of a single-organism approach. Computational biology has long used comparative and evolutionary approaches to extend the reach and accuracy of its analyses. In this paper, we describe ProPhyC, a probabilistic phylogenetic model and associated inference algorithms, designed to improve the inference of regulatory networks for a family of organisms by using known evolutionary relationships among these organisms. ProPhyC can be used with various network evolutionary models and any existing inference method. Extensive experimental results on both biological and synthetic data confirm that our model (through its associated refinement algorithms) yields substantial improvement in the quality of inferred networks over all current methods. We also compare ProPhyC with a transfer learning approach we design. This approach also uses phylogenetic relationships while inferring regulatory networks for a family of organisms. Using similar input information but designed in a very different framework, this transfer learning approach does not perform better than ProPhyC, which indicates that ProPhyC makes good use of the evolutionary information. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Algorithms to Detect Multiprotein Modularity Conserved during Evolution

    Page(s): 1046 - 1058
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1280 KB)  

    Detecting essential multiprotein modules that change infrequently during evolution is a challenging algorithmic task that is important for understanding the structure, function, and evolution of the biological cell. In this paper, we define a measure of modularity for interactomes and present a linear-time algorithm, Produles, for detecting multiprotein modularity conserved during evolution that improves on the running time of previous algorithms for related problems and offers desirable theoretical guarantees. We present a biologically motivated graph theoretic set of evaluation measures complementary to previous evaluation measures, demonstrate that Produles exhibits good performance by all measures, and describe certain recurrent anomalies in the performance of previous algorithms that are not detected by previous measures. Consideration of the newly defined measures and algorithm performance on these measures leads to useful insights on the nature of interactomics data and the goals of previous and current algorithms. Through randomization experiments, we demonstrate that conserved modularity is a defining characteristic of interactomes. Computational experiments on current experimentally derived interactomes for Homo sapiens and Drosophila melanogaster, combining results across algorithms, show that nearly 10 percent of current interactome proteins participate in multiprotein modules with good evidence in the protein interaction data of being conserved between human and Drosophila. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Predicting Protein Function by Multi-Label Correlated Semi-Supervised Learning

    Page(s): 1059 - 1069
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1617 KB)  

    Assigning biological functions to uncharacterized proteins is a fundamental problem in the postgenomic era. The increasing availability of large amounts of data on protein-protein interactions (PPIs) has led to the emergence of a considerable number of computational methods for determining protein function in the context of a network. These algorithms, however, treat each functional class in isolation and thereby often suffer from the difficulty of the scarcity of labeled data. In reality, different functional classes are naturally dependent on one another. We propose a new algorithm, Multi-label Correlated Semi-supervised Learning (MCSL), to incorporate the intrinsic correlations among functional classes into protein function prediction by leveraging the relationships provided by the PPI network and the functional class network. The guiding intuition is that the classification function should be sufficiently smooth on subgraphs where the respective topologies of these two networks are a good match. We encode this intuition as regularized learning with intraclass and interclass consistency, which can be understood as an extension of the graph-based learning with local and global consistency (LGC) method. Cross validation on the yeast proteome illustrates that MCSL consistently outperforms several state-of-the-art methods. Most notably, it effectively overcomes the problem associated with scarcity of label data. The supplementary files are freely available at http://sites.google.com/site/csaijiang/MCSL. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Identification of Essential Proteins Based on Edge Clustering Coefficient

    Page(s): 1070 - 1080
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1635 KB)  

    Identification of essential proteins is key to understanding the minimal requirements for cellular life and important for drug design. The rapid increase of available protein-protein interaction (PPI) data has made it possible to detect protein essentiality on network level. A series of centrality measures have been proposed to discover essential proteins based on network topology. However, most of them tended to focus only on the location of single protein, but ignored the relevance between interactions and protein essentiality. In this paper, a new centrality measure for identifying essential proteins based on edge clustering coefficient, named as NC, is proposed. Different from previous centrality measures, NC considers both the centrality of a node and the relationship between it and its neighbors. For each interaction in the network, we calculate its edge clustering coefficient. A node's essentiality is determined by the sum of the edge clustering coefficients of interactions connecting it and its neighbors. The new centrality measure NC takes into account the modular nature of protein essentiality. NC is applied to three different types of yeast protein-protein interaction networks, which are obtained from the DIP database, the MIPS database and the BioGRID database, respectively. The experimental results on the three different networks show that the number of essential proteins discovered by NC universally exceeds that discovered by the six other centrality measures: DC, BC, CC, SC, EC, and IC. Moreover, the essential proteins discovered by NC show significant cluster effect. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Computational Model for Predicting Protein Interactions Based on Multidomain Collaboration

    Page(s): 1081 - 1090
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1711 KB)  

    Recently, several domain-based computational models for predicting protein-protein interactions (PPIs) have been proposed. The conventional methods usually infer domain or domain combination (DC) interactions from already known interacting sets of proteins, and then predict PPIs using the information. However, the majority of these models often have limitations in providing detailed information on which domain pair (single domain interaction) or DC pair (multidomain interaction) will actually interact for the predicted protein interaction. Therefore, a more comprehensive and concrete computational model for the prediction of PPIs is needed. We developed a computational model to predict PPIs using the information of intraprotein domain cohesion and interprotein DC coupling interaction. A method of identifying the primary interacting DC pair was also incorporated into the model in order to infer actual participants in a predicted interaction. Our method made an apparent improvement in the PPI prediction accuracy, and the primary interacting DC pair identification was valid specifically in predicting multidomain protein interactions. In this paper, we demonstrate that 1) the intraprotein domain cohesion is meaningful in improving the accuracy of domain-based PPI prediction, 2) a prediction model incorporating the intradomain cohesion enables us to identify the primary interacting DC pair, and 3) a hybrid approach using the intra/interdomain interaction information can lead to a more accurate prediction. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Hybrid Approach to Survival Model Building Using Integration of Clinical and Molecular Information in Censored Data

    Page(s): 1091 - 1105
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (3151 KB)  

    In medical society, the prognostic models, which use clinicopathologic features and predict prognosis after a certain treatment, have been externally validated and used in practice. In recent years, most research has focused on high dimensional genomic data and small sample sizes. Since clinically similar but molecularly heterogeneous tumors may produce different clinical outcomes, the combination of clinical and genomic information, which may be complementary, is crucial to improve the quality of prognostic predictions. However, there is a lack of an integrating scheme for clinic-genomic models due to the P ≫ N problem, in particular, for a parsimonious model. We propose a methodology to build a reduced yet accurate integrative model using a hybrid approach based on the Cox regression model, which uses several dimension reduction techniques, L2 penalized maximum likelihood estimation (PMLE), and resampling methods to tackle the problem. The predictive accuracy of the modeling approach is assessed by several metrics via an independent and thorough scheme to compare competing methods. In breast cancer data studies on a metastasis and death event, we show that the proposed methodology can improve prediction accuracy and build a final model with a hybrid signature that is parsimonious when integrating both types of variables. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis

    Page(s): 1106 - 1119
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1720 KB)  

    A plenitude of feature selection (FS) methods is available in the literature, most of them rising as a need to analyze data of very high dimension, usually hundreds or thousands of variables. Such data sets are now available in various application areas like combinatorial chemistry, text mining, multivariate imaging, or bioinformatics. As a general accepted rule, these methods are grouped in filters, wrappers, and embedded methods. More recently, a new group of methods has been added in the general framework of FS: ensemble techniques. The focus in this survey is on filter feature selection methods for informative feature discovery in gene expression microarray (GEM) analysis, which is also known as differentially expressed genes (DEGs) discovery, gene prioritization, or biomarker discovery. We present them in a unified framework, using standardized notations in order to reveal their technical details and to highlight their common characteristics as well as their particularities. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • BpMatch: An Efficient Algorithm for a Segmental Analysis of Genomic Sequences

    Page(s): 1120 - 1127
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (639 KB)  

    Here, we propose BpMatch: an algorithm that, working on a suitably modified suffix-tree data structure, is able to compute, in a fast and efficient way, the coverage of a source sequence S on a target sequence T, by taking into account direct and reverse segments, eventually overlapped. Using BpMatch, the operator should define a priori, the minimum length l of a segment and the minimum number of occurrences minRep, so that only segments longer than l and having a number of occurrences greater than minRep are considered to be significant. BpMatch outputs the significant segments found and the computed segment-based distance. On the worst case, assuming the alphabet dimension d is a constant, the time required by BpMatch to calculate the coverage is O(l2n). On the average, by setting l≥ 2 logd(n), the time required to calculate the coverage is only O(n). BpMatch, thanks to the minRep parameter, can also be used to perform a self-covering: to cover a sequence using segments coming from itself, by avoiding the trivial solution of having a single segment coincident with the whole sequence. The result of the self-covering approach is a spectral representation of the repeats contained in the sequence. BpMatch is freely available on: www.sourceforge.net/projects/bpmatch/. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • CSD Homomorphisms between Phylogenetic Networks

    Page(s): 1128 - 1138
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (355 KB)  

    Since Darwin, species trees have been used as a simplified description of the relationships which summarize the complicated network N of reality. Recent evidence of hybridization and lateral gene transfer, however, suggest that there are situations where trees are inadequate. Consequently it is important to determine properties that characterize networks closely related to N and possibly more complicated than trees but lacking the full complexity of N. A connected surjective digraph map (CSD) is a map f from one network N to another network M such that every arc is either collapsed to a single vertex or is taken to an arc, such that f is surjective, and such that the inverse image of a vertex is always connected. CSD maps are shown to behave well under composition. It is proved that if there is a CSD map from N to M, then there is a way to lift an undirected version of M into N, often with added resolution. A CSD map from N to M puts strong constraints on N. In general, it may be useful to study classes of networks such that, for any N, there exists a CSD map from N to some standard member of that class. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • De Novo Design of Potential RecA Inhibitors Using MultiObjective Optimization

    Page(s): 1139 - 1154
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1289 KB)  

    De novo ligand design involves optimization of several ligand properties such as binding affinity, ligand volume, drug likeness, etc. Therefore, optimization of these properties independently and simultaneously seems appropriate. In this paper, the ligand design problem is modeled in a multiobjective using Archived MultiObjective Simulated Annealing (AMOSA) as the underlying search algorithm. The multiple objectives considered are the energy components similarity to a known inhibitor and a novel drug likeliness measure based on Lipinski's rule of five. RecA protein of Mycobacterium tuberculosis, causative agent of tuberculosis, is taken as the target for the drug design. To gauge the goodness of the results, they are compared to the outputs of LigBuilder, NEWLEAD, and Variable genetic algorithm (VGA). The same problem has also been modeled using a well-established genetic algorithm-based multiobjective optimization technique, Nondominated Sorting Genetic Algorithm-II (NSGA-II), to find the efficacy of AMOSA through comparative analysis. Results demonstrate that while some small molecules designed by the proposed approach are remarkably similar to the known inhibitors of RecA, some new ones are discovered that may be potential candidates for novel lead molecules against tuberculosis. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Detection of Outlier Residues for Improving Interface Prediction in Protein Heterocomplexes

    Page(s): 1155 - 1165
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (1502 KB)  

    Sequence-based understanding and identification of protein binding interfaces is a challenging research topic due to the complexity in protein systems and the imbalanced distribution between interface and noninterface residues. This paper presents an outlier detection idea to address the redundancy problem in protein interaction data. The cleaned training data are then used for improving the prediction performance. We use three novel measures to describe the extent a residue is considered as an outlier in comparison to the other residues: the distance of a residue instance from the center instance of all residue instances of the same class label (Dist), the probability of the class label of the residue instance (PCL), and the importance of within-class and between-class (IWB) residue instances. Outlier scores are computed by integrating the three factors; instances with a sufficiently large score are treated as outliers and removed. The data sets without outliers are taken as input for a support vector machine (SVM) ensemble. The proposed SVM ensemble trained on input data without outliers performs better than that with outliers. Our method is also more accurate than many literature methods on benchmark data sets. From our empirical studies, we found that some outlier interface residues are truly near to noninterface regions, and some outlier noninterface residues are close to interface regions. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.
  • Efficient Approaches for Retrieving Protein Tertiary Structures

    Page(s): 1166 - 1179
    Multimedia
    Save to Project icon | Request Permissions | Click to expandQuick Abstract | PDF file iconPDF (810 KB)  

    The 3D conformation of a protein in the space is the main factor which determines its function in living organisms. Due to the huge amount of newly discovered proteins, there is a need for fast and accurate computational methods for retrieving protein structures. Their purpose is to speed up the process of understanding the structure-to-function relationship which is crucial in the development of new drugs. There are many algorithms addressing the problem of protein structure retrieval. In this paper, we present several novel approaches for retrieving protein tertiary structures. We present our voxel-based descriptor. Then we present our protein ray-based descriptors which are applied on the interpolated protein backbone. We introduce five novel wavelet descriptors which perform wavelet transforms on the protein distance matrix. We also propose an efficient algorithm for distance matrix alignment named Matrix Alignment by Sequence Alignment within Sliding Window (MASASW), which has shown as much faster than DALI, CE, and MatAlign. We compared our approaches between themselves and with several existing algorithms, and they generally prove to be fast and accurate. MASASW achieves the highest accuracy. The ray and wavelet-based descriptors as well as MASASW are more accurate than CE. View full abstract»

    Full text access may be available. Click article title to sign in or learn about subscription options.

Aims & Scope

This bimonthly publishes archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Full Aims & Scope

Meet Our Editors

Editor-in-Chief
Ying Xu
University of Georgia
xyn@bmb.uga.edu

Associate Editor-in-Chief
Dong Xu
University of Missouri
xudong@missouri.edu